How to Extract Data from Website to Excel Using Python
Manually copying and pasting data from a website into an Excel sheet is a familiar task for many, but it's tedious, error-prone, and simply doesn't scale. Using Python, you can automate this entire process to pull web data directly into a spreadsheet, saving yourself hours of grunt work. This article provides a complete, step-by-step guide on how to extract data from a website and save it to an Excel file using a few powerful Python libraries.
What is Web Scraping and Why Use Python?
Web scraping (or data extraction) is the process of automatically collecting information from websites. Instead of you visiting a site and manually copying the data you need, a script does it for you. This is incredibly useful for gathering product prices, stock market data, real estate listings, contact information, or any other structured data available online.
While you could try to copy this data by hand, the limitations become obvious very quickly. It's time-consuming for more than a few pages, and any updates to the source website mean you have to do it all over again. Python is the go-to language for this task for several key reasons:
- Automation: Write the script once, and you can run it anytime to get the latest data. You can even schedule it to run automatically.
- Rich Libraries: Python has an ecosystem of powerful and easy-to-use libraries specifically designed for web scraping and data manipulation.
- Scalability: Scripts can handle extracting data from hundreds or thousands of pages far more efficiently than any human ever could.
- Flexibility: You have complete control over what data you extract and how you format it, from cleaning messy text to structuring it perfectly for analysis in Excel.
Setting Up Your Environment
Before you can start scraping, you'll need to install three essential Python libraries. If you don't have Python installed, you can download it from the official Python website. Once you have Python, you can install the libraries using a command line tool like Terminal (on macOS/Linux) or Command Prompt/PowerShell (on Windows).
Each of these libraries plays a specific role:
- Requests: This library acts like your web browser, sending an HTTP request to the website's server to fetch its HTML source code.
- Beautiful Soup (bs4): This library parses the raw HTML code returned by Requests, turning it into a structured object that's easy to navigate and search.
- Pandas & openpyxl: Pandas is an indispensable data analysis and manipulation library. We'll use it to organize our scraped data into a clean, table-like structure (a DataFrame) and then use the
openpyxlengine to export it into an.xlsxExcel file.
To install them, open your command line and run the following commands one by one:
pip install requests pip install beautifulsoup4 pip install pandas pip install openpyxl
With these tools installed, you're ready to start building your scraper.
A Step-by-Step Guide to Extracting Data
To make this practical, let's build a script that scrapes a classic example: the table of S&P 500 companies from Wikipedia. Our goal is to extract this table and save it as a clean Excel file.
Step 1: Inspect the Target Website
First, you need to understand the HTML structure of the page you want to scrape. Open the S&P 500 companies page on Wikipedia in your browser.
To find the table, right-click on it and select "Inspect" or "Inspect Element." This will open your browser's Developer Tools, showing you the HTML code for the page. As you hover over different parts of the code, the corresponding elements will be highlighted on the page.
In this case, you'll notice the data is contained within a <table> tag. A key piece of information is that this specific table has a unique ID: id="constituents". This is the perfect identifier for telling our script exactly which table on the page to grab.
Step 2: Fetch the Web Page with Python
Let's start our Python script by using the requests library to download the HTML content of the page. This is like asking a robot to open the webpage and save its source code.
import requests
# The URL of the page we want to scrape
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
# Send a GET request to the website
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
html_content = response.content
print("Successfully fetched the page HTML.")
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
exit()This code sends a request to the URL and checks if it was successful. If so, it stores the page's HTML in the html_content variable.
Step 3: Parse the HTML with Beautiful Soup
Now that we have the raw HTML, we need to make sense of it. This is where Beautiful Soup comes in. We’ll feed it the HTML content and tell it which part we're interested in - our table with the ID constituents.
from bs4 import BeautifulSoup
# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')
# Find the specific table by its ID
table = soup.find('table', {'id': 'constituents'})
if not table:
print("Could not find the target table.")
exit()The soup.find() method conveniently finds the first table tag that has an attribute id="constituents", giving us exactly the data we need.
Step 4: Extract Header and Row Data
A table is made of headers (<th>) tags and rows (<tr>) tags, which contain data cells (<td>) tags. Our next step is to loop through the table and pull out the text from each cell.
First, we'll grab the headers. Then, we can iterate through each row in the table's body, collecting the data from each cell and storing it. A list of dictionaries is a great way to structure this, where each dictionary represents a row and a company is a key (like Symbol: MMM).
# Extract the table headers
headers = []
for th in table.find_all('th'):
headers.append(th.text.strip())
# Extract the table rows
rows = []
# Find all table row tags within tbody
for tr in table.find('tbody').find_all('tr')[1:]:
cells = tr.find_all('td')
if len(cells) > 0:
row_data = {}
# Loop through headers and cells at the same time
for i, header in enumerate(headers):
if i < len(cells): # Ensure we don't go out of bounds
row_data[header] = cells[i].text.strip()
rows.append(row_data)
print(f"Extracted {len(rows)} rows of data.")This code extracts headers, then iterates through each row in the table body, collecting all the cell data into a list of dictionaries.
Step 5: Write the Data to an Excel File Using Pandas
With our data neatly organized in a list of dictionaries, the final step is incredibly simple thanks to pandas. We will convert our list into a pandas DataFrame (which is essentially a structured table) and then use a built-in function to save it as an Excel file.
import pandas as pd
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(rows)
# Define the output Excel file name
file_name = 'sp500_companies.xlsx'
# Save the DataFrame to an Excel file
# The index=False argument prevents pandas from writing the DataFrame index as a column
df.to_excel(file_name, index=False, engine='openpyxl')
print(f"Data successfully extracted and saved to {file_name}")And that's it! When you run this script, it will create an Excel file named sp500_companies.xlsx in the same directory, containing all the data from the website's table.
The Complete Script
Here is the full Python script, combining all the steps from above. You can save this as a .py file (e.g., scraper.py) and run it from your command line.
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_sp500_to_excel(url, filename):
"""
Scrapes the S&P 500 company table from Wikipedia and saves it to an Excel file.
"""
print("Fetching webpage...")
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
return
print("Parsing HTML content...")
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', {'id': 'constituents'})
if not table:
print("Could not find the target table on the page.")
return
print("Extracting headers and rows...")
# Extract headers from the table's first row (thead > tr > th)
headers = [th.text.strip() for th in table.find('thead').find_all('th')]
# Extract rows from table body (tbody > tr)
rows_data = []
for row in table.find('tbody').find_all('tr'):
cells = row.find_all('td')
if not cells:
continue # Skip header rows often found in tbody
row_dict = {}
for index, th in enumerate(headers):
# Check for index out of bounds if rows are malformed
if index < len(cells):
cell_text = cells[index].text.strip()
row_dict[th] = cell_text
else:
row_dict[th] = "" # Add empty string for missing cells
rows_data.append(row_dict)
if not rows_data:
print("No data rows found.")
return
print(f"Found {len(rows_data)} companies. Converting to DataFrame...")
df = pd.DataFrame(rows_data)
# Save the company symbol from the first column only.
if 'Symbol' in df.columns:
df['Symbol'] = df['Symbol'].str.replace(r'\[.*\]', '', regex=True)
print(f"Saving data to {filename}...")
df.to_excel(filename, index=False, engine='openpyxl')
print("Scraping complete. Data saved successfully!")
if __name__ == '__main__':
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
OUTPUT_FILENAME = 'sp500_companies.xlsx'
scrape_sp500_to_excel(WIKI_URL, OUTPUT_FILENAME)Important Considerations and Best Practices
Before you go scraping everything, keep a few things in mind:
- Be Respectful: Don't spam a website with hundreds of requests per second. This can overload the server and get your IP address blocked. If you're scraping many pages, add small delays between requests (e.g.,
time.sleep(1)). Always check a website'srobots.txtfile (e.g.,example.com/robots.txt) to see its scraping policies. - Handle Errors Gracefully: Websites change their layout all the time. Your script might break if an HTML element you're targeting is moved or renamed. Build your scripts with error handling (try-except blocks) to manage these changes gracefully.
- Dynamic Content: The method above works perfectly for static websites where all the content is in the initial HTML. However, many modern websites load data with JavaScript after the page loads. In those cases,
requestsmight not see the data. For these more advanced cases, you would need tools like Selenium or Playwright, which can control a real web browser to interact with the page.
Final Thoughts
Automating data extraction from websites with Python is a game-changing skill. By combining Requests, Beautiful Soup, and Pandas, you can turn a tedious manual process into a simple, repeatable script, freeing up your time for more valuable analysis. This method empowers you to gather the data you need directly into the tool you use most: Excel.
While writing Python scripts is powerful, sometimes you need answers without the coding. We built Graphed to eliminate this kind of data wrangling for marketing and sales data. Instead of setting up scrapers or learning APIs for your various platforms (like Google Analytics, Shopify, Facebook Ads), you can connect them to our platform with a few clicks. From there, you just ask questions in plain English - like "Which marketing campaigns are driving the most revenue this quarter?" - and get back live, interactive dashboards instantly.
Related Articles
How to Connect Facebook to Google Data Studio: The Complete Guide for 2026
Connecting Facebook Ads to Google Data Studio (now called Looker Studio) has become essential for digital marketers who want to create comprehensive, visually appealing reports that go beyond the basic analytics provided by Facebook's native Ads Manager. If you're struggling with fragmented reporting across multiple platforms or spending too much time manually exporting data, this guide will show you exactly how to streamline your Facebook advertising analytics.
Appsflyer vs Mixpanel: Complete 2026 Comparison Guide
The difference between AppsFlyer and Mixpanel isn't just about features—it's about understanding two fundamentally different approaches to data that can make or break your growth strategy. One tracks how users find you, the other reveals what they do once they arrive. Most companies need insights from both worlds, but knowing where to start can save you months of implementation headaches and thousands in wasted budget.
DashThis vs AgencyAnalytics: The Ultimate Comparison Guide for Marketing Agencies
When it comes to choosing the right marketing reporting platform, agencies often find themselves torn between two industry leaders: DashThis and AgencyAnalytics. Both platforms promise to streamline reporting, save time, and impress clients with stunning visualizations. But which one truly delivers on these promises?