Web Scraping with Python

Web scraping is the process of extracting data from websites automatically. Python provides several powerful libraries for web scraping, such as BeautifulSoup, requests, and Selenium.

When to Use Web Scraping

Web scraping is useful for:

Data Collection: Gathering information for analysis (e.g., stock prices, news articles).
Monitoring Websites: Tracking changes on competitors’ sites.
Automation: Extracting repetitive data without manual effort.
Data Aggregation: Collecting data from multiple sources into a single repository.

Important: Always check a website’s robots.txt file and terms of service before scraping.

Setting Up the Environment

Install the required libraries:

pip install requests beautifulsoup4 lxml

Using `requests` to Fetch Web Pages

The requests library is used to download web pages as HTML.

import requests

url = "https://example.com"
response = requests.get(url)

if response.status_code == 200:
    print(response.text[:500])  # Print first 500 characters of the page
else:
    print(f"Failed to retrieve page, status code: {response.status_code}")

Handling Common Issues:

Headers: Some websites block requests without headers.

headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

Timeouts: Avoid hanging requests.

response = requests.get(url, timeout=10)

Parsing HTML with BeautifulSoup

BeautifulSoup helps extract data from HTML content.

from bs4 import BeautifulSoup

html_content = "<html><body><h1>Hello, World!</h1></body></html>"
soup = BeautifulSoup(html_content, "html.parser")

print(soup.h1.text)  # Output: Hello, World!

Extracting Elements

response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "html.parser")

# Find elements
title = soup.find("title").text
all_links = soup.find_all("a")

print("Page Title:", title)
for link in all_links:
    print("Link:", link.get('href'))

Navigating the HTML Structure

BeautifulSoup provides methods to navigate through the HTML structure:

soup.title.text – Gets the text of the <title> tag.
soup.find("div", class_="content") – Finds a <div> with a specific class.
soup.find_all("p") – Finds all paragraph tags.

Example:

content = soup.find("div", {"id": "main-content"})
paragraphs = content.find_all("p")

for p in paragraphs:
    print(p.text)

Handling Pagination

Many websites display data across multiple pages. You can scrape multiple pages using loops.

base_url = "https://example.com/page="
for page in range(1, 6):
    url = f"{base_url}{page}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    print(soup.title.text)

Scraping Tables

Web pages often contain structured data in tables.

import pandas as pd

table = soup.find("table")
rows = table.find_all("tr")

data = []
for row in rows:
    cols = row.find_all("td")
    data.append([col.text.strip() for col in cols])

df = pd.DataFrame(data, columns=["Column1", "Column2"])
print(df)

Web Scraping with Selenium

For dynamic websites that rely on JavaScript, Selenium can be used to interact with the page.

Installation:

pip install selenium webdriver-manager

Example:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get("https://example.com")
print(driver.title)

element = driver.find_element(By.TAG_NAME, "h1")
print(element.text)

driver.quit()

Interacting with Elements:

Clicking buttons: element.click()
Typing into fields: element.send_keys("Python")
Submitting forms: element.submit()

Handling Common Challenges

Dynamic Content: Use Selenium or wait for JavaScript to load.
CAPTCHAs: Some sites block bots with CAPTCHAs.
IP Blocking: Rotate proxies or use a delay between requests.
Session Handling: Use cookies and authentication tokens.

Ethical Web Scraping Guidelines

Respect robots.txt: Check if scraping is allowed.
Avoid Overloading Servers: Use delays between requests.
Give Credit: Respect data ownership and licensing.
Don’t Harvest Personal Data: Ensure compliance with regulations like GDPR.

Example of checking robots.txt:

response = requests.get("https://example.com/robots.txt")
print(response.text)

Storing Scraped Data

Once data is extracted, it can be stored in various formats:

CSV File:

import csv

with open("data.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["Name", "Price"])
    writer.writerow(["Item1", "$10"])

JSON File:

import json

data = {"name": "Item1", "price": "$10"}
with open("data.json", "w") as f:
    json.dump(data, f)

Database (SQLite Example):

import sqlite3

conn = sqlite3.connect("data.db")
c = conn.cursor()
c.execute("CREATE TABLE IF NOT EXISTS products (name TEXT, price TEXT)")
c.execute("INSERT INTO products VALUES ('Item1', '$10')")
conn.commit()
conn.close()

Practice Exercises

Scrape the titles of articles from a news website.
Extract product names and prices from an e-commerce website.
Use Selenium to interact with a login form and scrape data post-login.
Store scraped data in a CSV file.

Summary

requests: Fetch web pages.
BeautifulSoup: Parse HTML and extract data.
Selenium: Handle dynamic web pages.
Ethics: Always follow best practices to avoid legal issues.

With these tools, you can automate data extraction and build powerful data-driven applications.

Next Lesson: SQL Basics and Python Integration

Web Scraping with Python

When to Use Web Scraping

Setting Up the Environment

Using requests to Fetch Web Pages

Handling Common Issues:

Parsing HTML with BeautifulSoup

Extracting Elements

Navigating the HTML Structure

Handling Pagination

Scraping Tables

Web Scraping with Selenium

Installation:

Example:

Interacting with Elements:

Handling Common Challenges

Ethical Web Scraping Guidelines

Storing Scraped Data

Practice Exercises

Summary

Using `requests` to Fetch Web Pages