How to Extract Data from Website to Excel Using Python

Manually copying and pasting data from a website into an Excel sheet is a familiar task for many, but it's tedious, error-prone, and simply doesn't scale. Using Python, you can automate this entire process to pull web data directly into a spreadsheet, saving yourself hours of grunt work. This article provides a complete, step-by-step guide on how to extract data from a website and save it to an Excel file using a few powerful Python libraries.

What is Web Scraping and Why Use Python?

Web scraping (or data extraction) is the process of automatically collecting information from websites. Instead of you visiting a site and manually copying the data you need, a script does it for you. This is incredibly useful for gathering product prices, stock market data, real estate listings, contact information, or any other structured data available online.

While you could try to copy this data by hand, the limitations become obvious very quickly. It's time-consuming for more than a few pages, and any updates to the source website mean you have to do it all over again. Python is the go-to language for this task for several key reasons:

Automation: Write the script once, and you can run it anytime to get the latest data. You can even schedule it to run automatically.
Rich Libraries: Python has an ecosystem of powerful and easy-to-use libraries specifically designed for web scraping and data manipulation.
Scalability: Scripts can handle extracting data from hundreds or thousands of pages far more efficiently than any human ever could.
Flexibility: You have complete control over what data you extract and how you format it, from cleaning messy text to structuring it perfectly for analysis in Excel.

Setting Up Your Environment

Before you can start scraping, you'll need to install three essential Python libraries. If you don't have Python installed, you can download it from the official Python website. Once you have Python, you can install the libraries using a command line tool like Terminal (on macOS/Linux) or Command Prompt/PowerShell (on Windows).

Each of these libraries plays a specific role:

Requests: This library acts like your web browser, sending an HTTP request to the website's server to fetch its HTML source code.
Beautiful Soup (bs4): This library parses the raw HTML code returned by Requests, turning it into a structured object that's easy to navigate and search.
Pandas & openpyxl: Pandas is an indispensable data analysis and manipulation library. We'll use it to organize our scraped data into a clean, table-like structure (a DataFrame) and then use the openpyxl engine to export it into an .xlsx Excel file.

To install them, open your command line and run the following commands one by one:

pip install requests pip install beautifulsoup4 pip install pandas pip install openpyxl

With these tools installed, you're ready to start building your scraper.

A Step-by-Step Guide to Extracting Data

To make this practical, let's build a script that scrapes a classic example: the table of S&P 500 companies from Wikipedia. Our goal is to extract this table and save it as a clean Excel file.

Step 1: Inspect the Target Website

First, you need to understand the HTML structure of the page you want to scrape. Open the S&P 500 companies page on Wikipedia in your browser.

To find the table, right-click on it and select "Inspect" or "Inspect Element." This will open your browser's Developer Tools, showing you the HTML code for the page. As you hover over different parts of the code, the corresponding elements will be highlighted on the page.

In this case, you'll notice the data is contained within a <table> tag. A key piece of information is that this specific table has a unique ID: id="constituents". This is the perfect identifier for telling our script exactly which table on the page to grab.

Step 2: Fetch the Web Page with Python

Let's start our Python script by using the requests library to download the HTML content of the page. This is like asking a robot to open the webpage and save its source code.

import requests

# The URL of the page we want to scrape
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

# Send a GET request to the website
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    html_content = response.content
    print("Successfully fetched the page HTML.")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

This code sends a request to the URL and checks if it was successful. If so, it stores the page's HTML in the html_content variable.

Step 3: Parse the HTML with Beautiful Soup

Now that we have the raw HTML, we need to make sense of it. This is where Beautiful Soup comes in. We’ll feed it the HTML content and tell it which part we're interested in - our table with the ID constituents.

from bs4 import BeautifulSoup

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Find the specific table by its ID
table = soup.find('table', {'id': 'constituents'})

if not table:
    print("Could not find the target table.")
    exit()

The soup.find() method conveniently finds the first table tag that has an attribute id="constituents", giving us exactly the data we need.

Step 4: Extract Header and Row Data

A table is made of headers (<th>) tags and rows (<tr>) tags, which contain data cells (<td>) tags. Our next step is to loop through the table and pull out the text from each cell.

First, we'll grab the headers. Then, we can iterate through each row in the table's body, collecting the data from each cell and storing it. A list of dictionaries is a great way to structure this, where each dictionary represents a row and a company is a key (like Symbol: MMM).

# Extract the table headers
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())

# Extract the table rows
rows = []
# Find all table row tags within tbody
for tr in table.find('tbody').find_all('tr')[1:]:
    cells = tr.find_all('td')
    if len(cells) > 0:
        row_data = {}
        # Loop through headers and cells at the same time
        for i, header in enumerate(headers):
             if i < len(cells):  # Ensure we don't go out of bounds
                row_data[header] = cells[i].text.strip()
        rows.append(row_data)

print(f"Extracted {len(rows)} rows of data.")

This code extracts headers, then iterates through each row in the table body, collecting all the cell data into a list of dictionaries.

Step 5: Write the Data to an Excel File Using Pandas

With our data neatly organized in a list of dictionaries, the final step is incredibly simple thanks to pandas. We will convert our list into a pandas DataFrame (which is essentially a structured table) and then use a built-in function to save it as an Excel file.

import pandas as pd

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(rows)

# Define the output Excel file name
file_name = 'sp500_companies.xlsx'

# Save the DataFrame to an Excel file
# The index=False argument prevents pandas from writing the DataFrame index as a column
df.to_excel(file_name, index=False, engine='openpyxl')

print(f"Data successfully extracted and saved to {file_name}")

And that's it! When you run this script, it will create an Excel file named sp500_companies.xlsx in the same directory, containing all the data from the website's table.

The Complete Script

Here is the full Python script, combining all the steps from above. You can save this as a .py file (e.g., scraper.py) and run it from your command line.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_sp500_to_excel(url, filename):
    """
    Scrapes the S&P 500 company table from Wikipedia and saves it to an Excel file.
    """
    print("Fetching webpage...")
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return

    print("Parsing HTML content...")
    soup = BeautifulSoup(response.content, 'html.parser')

    table = soup.find('table', {'id': 'constituents'})
    if not table:
        print("Could not find the target table on the page.")
        return
        
    print("Extracting headers and rows...")
    # Extract headers from the table's first row (thead > tr > th)
    headers = [th.text.strip() for th in table.find('thead').find_all('th')]

    # Extract rows from table body (tbody > tr)
    rows_data = []
    for row in table.find('tbody').find_all('tr'):
        cells = row.find_all('td')
        if not cells:
            continue  # Skip header rows often found in tbody
        
        row_dict = {}
        for index, th in enumerate(headers):
            # Check for index out of bounds if rows are malformed
            if index < len(cells):
                cell_text = cells[index].text.strip()
                row_dict[th] = cell_text
            else:
                row_dict[th] = "" # Add empty string for missing cells
        rows_data.append(row_dict)
         
    if not rows_data:
        print("No data rows found.")
        return
        
    print(f"Found {len(rows_data)} companies. Converting to DataFrame...")
    df = pd.DataFrame(rows_data)
    
    # Save the company symbol from the first column only.
    if 'Symbol' in df.columns:
        df['Symbol'] = df['Symbol'].str.replace(r'\[.*\]', '', regex=True)

    print(f"Saving data to {filename}...")
    df.to_excel(filename, index=False, engine='openpyxl')
    
    print("Scraping complete. Data saved successfully!")

if __name__ == '__main__':
    WIKI_URL = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
    OUTPUT_FILENAME = 'sp500_companies.xlsx'
    scrape_sp500_to_excel(WIKI_URL, OUTPUT_FILENAME)

Important Considerations and Best Practices

Before you go scraping everything, keep a few things in mind:

Be Respectful: Don't spam a website with hundreds of requests per second. This can overload the server and get your IP address blocked. If you're scraping many pages, add small delays between requests (e.g., time.sleep(1)). Always check a website's robots.txt file (e.g., example.com/robots.txt) to see its scraping policies.
Handle Errors Gracefully: Websites change their layout all the time. Your script might break if an HTML element you're targeting is moved or renamed. Build your scripts with error handling (try-except blocks) to manage these changes gracefully.
Dynamic Content: The method above works perfectly for static websites where all the content is in the initial HTML. However, many modern websites load data with JavaScript after the page loads. In those cases, requests might not see the data. For these more advanced cases, you would need tools like Selenium or Playwright, which can control a real web browser to interact with the page.

Final Thoughts

Automating data extraction from websites with Python is a game-changing skill. By combining Requests, Beautiful Soup, and Pandas, you can turn a tedious manual process into a simple, repeatable script, freeing up your time for more valuable analysis. This method empowers you to gather the data you need directly into the tool you use most: Excel.

While writing Python scripts is powerful, sometimes you need answers without the coding. We built Graphed to eliminate this kind of data wrangling for marketing and sales data. Instead of setting up scrapers or learning APIs for your various platforms (like Google Analytics, Shopify, Facebook Ads), you can connect them to our platform with a few clicks. From there, you just ask questions in plain English - like "Which marketing campaigns are driving the most revenue this quarter?" - and get back live, interactive dashboards instantly.

How to Extract Data from Website to Excel Using Python

What is Web Scraping and Why Use Python?

Setting Up Your Environment

A Step-by-Step Guide to Extracting Data

Step 1: Inspect the Target Website

Step 2: Fetch the Web Page with Python

Step 3: Parse the HTML with Beautiful Soup

Step 4: Extract Header and Row Data

Step 5: Write the Data to an Excel File Using Pandas

The Complete Script

Important Considerations and Best Practices

Final Thoughts

Related Articles

How to Connect Facebook to Google Data Studio: The Complete Guide for 2026

Appsflyer vs Mixpanel: Complete 2026 Comparison Guide

DashThis vs AgencyAnalytics: The Ultimate Comparison Guide for Marketing Agencies

What is Web Scraping and Why Use Python?

Setting Up Your Environment

A Step-by-Step Guide to Extracting Data

Step 1: Inspect the Target Website

Step 2: Fetch the Web Page with Python

Step 3: Parse the HTML with Beautiful Soup

Step 4: Extract Header and Row Data

Step 5: Write the Data to an Excel File Using Pandas

The Complete Script

Important Considerations and Best Practices

Final Thoughts

Related Articles

How to Connect Facebook to Google Data Studio: The Complete Guide for 2026

Appsflyer vs Mixpanel​: Complete 2026 Comparison Guide

DashThis vs AgencyAnalytics: The Ultimate Comparison Guide for Marketing Agencies

Appsflyer vs Mixpanel: Complete 2026 Comparison Guide