Web Scraping with Python
Web scraping is the process of extracting data from websites automatically. Python provides several powerful libraries for web scraping, such as BeautifulSoup
, requests
, and Selenium
.
When to Use Web Scraping
Web scraping is useful for:
- Data Collection: Gathering information for analysis (e.g., stock prices, news articles).
- Monitoring Websites: Tracking changes on competitors’ sites.
- Automation: Extracting repetitive data without manual effort.
- Data Aggregation: Collecting data from multiple sources into a single repository.
Important: Always check a website’s robots.txt
file and terms of service before scraping.
Setting Up the Environment
Install the required libraries:
pip install requests beautifulsoup4 lxml
Using requests
to Fetch Web Pages
The requests
library is used to download web pages as HTML.
import requests
url = "https://example.com"
response = requests.get(url)
if response.status_code == 200:
print(response.text[:500]) # Print first 500 characters of the page
else:
print(f"Failed to retrieve page, status code: {response.status_code}")
Handling Common Issues:
-
Headers: Some websites block requests without headers.
headers = {"User-Agent": "Mozilla/5.0"} response = requests.get(url, headers=headers)
-
Timeouts: Avoid hanging requests.
response = requests.get(url, timeout=10)
Parsing HTML with BeautifulSoup
BeautifulSoup
helps extract data from HTML content.
from bs4 import BeautifulSoup
html_content = "<html><body><h1>Hello, World!</h1></body></html>"
soup = BeautifulSoup(html_content, "html.parser")
print(soup.h1.text) # Output: Hello, World!
Extracting Elements
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "html.parser")
# Find elements
title = soup.find("title").text
all_links = soup.find_all("a")
print("Page Title:", title)
for link in all_links:
print("Link:", link.get('href'))
Navigating the HTML Structure
BeautifulSoup provides methods to navigate through the HTML structure:
soup.title.text
– Gets the text of the<title>
tag.soup.find("div", class_="content")
– Finds a<div>
with a specific class.soup.find_all("p")
– Finds all paragraph tags.
Example:
content = soup.find("div", {"id": "main-content"})
paragraphs = content.find_all("p")
for p in paragraphs:
print(p.text)
Handling Pagination
Many websites display data across multiple pages. You can scrape multiple pages using loops.
base_url = "https://example.com/page="
for page in range(1, 6):
url = f"{base_url}{page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.text)
Scraping Tables
Web pages often contain structured data in tables.
import pandas as pd
table = soup.find("table")
rows = table.find_all("tr")
data = []
for row in rows:
cols = row.find_all("td")
data.append([col.text.strip() for col in cols])
df = pd.DataFrame(data, columns=["Column1", "Column2"])
print(df)
Web Scraping with Selenium
For dynamic websites that rely on JavaScript, Selenium can be used to interact with the page.
Installation:
pip install selenium webdriver-manager
Example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://example.com")
print(driver.title)
element = driver.find_element(By.TAG_NAME, "h1")
print(element.text)
driver.quit()
Interacting with Elements:
- Clicking buttons:
element.click()
- Typing into fields:
element.send_keys("Python")
- Submitting forms:
element.submit()
Handling Common Challenges
- Dynamic Content: Use Selenium or wait for JavaScript to load.
- CAPTCHAs: Some sites block bots with CAPTCHAs.
- IP Blocking: Rotate proxies or use a delay between requests.
- Session Handling: Use cookies and authentication tokens.
Ethical Web Scraping Guidelines
- Respect
robots.txt
: Check if scraping is allowed. - Avoid Overloading Servers: Use delays between requests.
- Give Credit: Respect data ownership and licensing.
- Don’t Harvest Personal Data: Ensure compliance with regulations like GDPR.
Example of checking robots.txt
:
response = requests.get("https://example.com/robots.txt")
print(response.text)
Storing Scraped Data
Once data is extracted, it can be stored in various formats:
-
CSV File:
import csv with open("data.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerow(["Name", "Price"]) writer.writerow(["Item1", "$10"])
-
JSON File:
import json data = {"name": "Item1", "price": "$10"} with open("data.json", "w") as f: json.dump(data, f)
-
Database (SQLite Example):
import sqlite3 conn = sqlite3.connect("data.db") c = conn.cursor() c.execute("CREATE TABLE IF NOT EXISTS products (name TEXT, price TEXT)") c.execute("INSERT INTO products VALUES ('Item1', '$10')") conn.commit() conn.close()
Practice Exercises
- Scrape the titles of articles from a news website.
- Extract product names and prices from an e-commerce website.
- Use Selenium to interact with a login form and scrape data post-login.
- Store scraped data in a CSV file.
Summary
- requests: Fetch web pages.
- BeautifulSoup: Parse HTML and extract data.
- Selenium: Handle dynamic web pages.
- Ethics: Always follow best practices to avoid legal issues.
With these tools, you can automate data extraction and build powerful data-driven applications.
Next Lesson: SQL Basics and Python Integration