Scenario 08: Web Scraping

Target Audience: Scraper developers, data engineers Difficulty: Intermediate Keywords: web scraping, beautifulsoup, lxml, aiohttp, httpx, concurrent fetching

📋 The Problem

Web scraping has two distinct phases with different needs:

Phase 1: Fetching (I/O bound)

Download HTML from many URLs
Network-bound operation
Perfect for async/await
Want high concurrency

Phase 2: Parsing (CPU bound)

Parse HTML with BeautifulSoup/lxml
CPU-intensive operation
Synchronous libraries
Blocks event loop if not careful

Traditional approaches:

All Sync:

Simple to write
Sequential fetching (slow)
Can’t exploit concurrency
Wastes time waiting

Manual Async:

Async fetch with aiohttp
Manual threading for parsing
Complex coordination
Error handling tricky
Boilerplate heavy

The dilemma: Need async for fetching, sync for parsing.

💡 Solution with SmartAsync

Natural separation of concerns:

Async fetch methods (network I/O)
Sync parsing methods (CPU work)
SmartAsync handles coordination
Clean, maintainable code
Automatic thread offloading

Pattern:

Scraper
  ├─→ fetch_url() - async (concurrent)
  ├─→ fetch_many() - async (concurrent)
  ├─→ parse_html() - sync (CPU-bound)
  └─→ extract_data() - sync (BeautifulSoup)

Framework coordinates async fetching + sync parsing automatically.

🎯 When to Use

Ideal for:

Multi-page scraping
Concurrent URL fetching
BeautifulSoup/lxml parsing
Data extraction pipelines
Content aggregation
Price monitoring
Social media scraping

Perfect when:

Need to fetch 100+ pages
Parsing is CPU-intensive
Want clean code separation
Using sync parsing libraries
Building scraping framework

⚠️ Considerations

Design Patterns

Concurrent fetching:

Use asyncio.gather() for multiple URLs
Respect rate limits
Handle retries gracefully
Connection pooling

Parsing strategy:

Parse in threads (don’t block loop)
Consider batch parsing
Reuse BeautifulSoup parsers
Cache parsed results

Error handling:

Network errors (timeouts, 404s)
Parse errors (malformed HTML)
Rate limiting (429 responses)
Graceful degradation

Performance Optimization

Fetching:

Connection pooling (httpx.AsyncClient)
Concurrent requests limit
Adaptive rate limiting
DNS caching

Parsing:

Choose right parser (lxml vs html.parser)
Parse only what you need
Stream large documents
Consider parallel parsing

Memory management:

Don’t keep all HTML in memory
Process and discard
Use generators
Monitor memory usage

Ethical Considerations

Best practices:

Respect robots.txt
Implement rate limiting
Use reasonable concurrency
Set User-Agent header
Cache responses
Don’t hammer servers

Legal considerations:

Check terms of service
Respect copyright
Follow data protection laws
Rate limiting requirements

When NOT to Use

Avoid if:

Scraping single page (no concurrency benefit)
Site has official API (use that instead)
JavaScript-heavy site (need browser automation)
Real-time scraping (need websockets)

Better alternatives:

Scrapy - Full-featured framework
Playwright/Selenium - Browser automation
Official APIs - Always prefer if available

📚 Technology Stack

Fetching libraries:

httpx - Modern async HTTP client
aiohttp - Popular async HTTP
requests - Traditional sync (not ideal)

Parsing libraries:

BeautifulSoup - Easy, flexible
lxml - Fast, powerful
html5lib - Strict HTML5 parsing
selectolax - Very fast parser

Frameworks:

Scrapy - Full scraping framework
newspaper3k - Article extraction
trafilatura - Content extraction

🎯 Scraper Architecture

Layered design:

1. Transport Layer (async)
   - HTTP client
   - Connection pooling
   - Rate limiting

2. Fetching Layer (async)
   - URL management
   - Retry logic
   - Error handling

3. Parsing Layer (sync + threading)
   - HTML parsing
   - Data extraction
   - Validation

4. Storage Layer (async or sync)
   - Database writes
   - File storage
   - Caching

SmartAsync bridges layers 2-3 automatically.

🔍 Common Challenges

Rate limiting:

Implement delays between requests
Respect Retry-After headers
Exponential backoff
Per-domain rate limits

Session management:

Login flows
Cookie handling
CSRF tokens
Session persistence

Anti-scraping measures:

CAPTCHA detection
IP blocking
JavaScript challenges
User-Agent checking

Solutions:

Rotate User-Agents
Use proxy rotation
Implement delays
Handle CAPTCHAs appropriately

📊 Performance Expectations

Sync scraping:

1 page/second (sequential)
Limited by network latency
Simple code

Async scraping (with SmartAsync):

10-50 pages/second (concurrent)
Limited by bandwidth and rate limits
Clean code
Automatic thread offloading for parsing

Scrapy (comparison):

50-100+ pages/second
More complex setup
Full framework

🎯 Success Metrics

Your scraper is successful when:

High concurrency without complexity
Clean separation of fetch/parse
Respects site resources
Robust error handling
Good performance
Easy to maintain

Next Steps:

See 01: Sync App → Async Libraries for CLI scraper tools
Check 09: Interactive Environments for prototyping