Scenario 08: Web Scraping
Target Audience: Scraper developers, data engineers Difficulty: Intermediate Keywords: web scraping, beautifulsoup, lxml, aiohttp, httpx, concurrent fetching
📋 The Problem
Web scraping has two distinct phases with different needs:
Phase 1: Fetching (I/O bound)
Download HTML from many URLs
Network-bound operation
Perfect for async/await
Want high concurrency
Phase 2: Parsing (CPU bound)
Parse HTML with BeautifulSoup/lxml
CPU-intensive operation
Synchronous libraries
Blocks event loop if not careful
Traditional approaches:
All Sync:
Simple to write
Sequential fetching (slow)
Can’t exploit concurrency
Wastes time waiting
Manual Async:
Async fetch with aiohttp
Manual threading for parsing
Complex coordination
Error handling tricky
Boilerplate heavy
The dilemma: Need async for fetching, sync for parsing.
💡 Solution with SmartAsync
Natural separation of concerns:
Async fetch methods (network I/O)
Sync parsing methods (CPU work)
SmartAsync handles coordination
Clean, maintainable code
Automatic thread offloading
Pattern:
Scraper
├─→ fetch_url() - async (concurrent)
├─→ fetch_many() - async (concurrent)
├─→ parse_html() - sync (CPU-bound)
└─→ extract_data() - sync (BeautifulSoup)
Framework coordinates async fetching + sync parsing automatically.
🎯 When to Use
Ideal for:
Multi-page scraping
Concurrent URL fetching
BeautifulSoup/lxml parsing
Data extraction pipelines
Content aggregation
Price monitoring
Social media scraping
Perfect when:
Need to fetch 100+ pages
Parsing is CPU-intensive
Want clean code separation
Using sync parsing libraries
Building scraping framework
⚠️ Considerations
Design Patterns
Concurrent fetching:
Use asyncio.gather() for multiple URLs
Respect rate limits
Handle retries gracefully
Connection pooling
Parsing strategy:
Parse in threads (don’t block loop)
Consider batch parsing
Reuse BeautifulSoup parsers
Cache parsed results
Error handling:
Network errors (timeouts, 404s)
Parse errors (malformed HTML)
Rate limiting (429 responses)
Graceful degradation
Performance Optimization
Fetching:
Connection pooling (httpx.AsyncClient)
Concurrent requests limit
Adaptive rate limiting
DNS caching
Parsing:
Choose right parser (lxml vs html.parser)
Parse only what you need
Stream large documents
Consider parallel parsing
Memory management:
Don’t keep all HTML in memory
Process and discard
Use generators
Monitor memory usage
Ethical Considerations
Best practices:
Respect robots.txt
Implement rate limiting
Use reasonable concurrency
Set User-Agent header
Cache responses
Don’t hammer servers
Legal considerations:
Check terms of service
Respect copyright
Follow data protection laws
Rate limiting requirements
When NOT to Use
Avoid if:
Scraping single page (no concurrency benefit)
Site has official API (use that instead)
JavaScript-heavy site (need browser automation)
Real-time scraping (need websockets)
Better alternatives:
Scrapy - Full-featured framework
Playwright/Selenium - Browser automation
Official APIs - Always prefer if available
📚 Technology Stack
Fetching libraries:
httpx - Modern async HTTP client
aiohttp - Popular async HTTP
requests - Traditional sync (not ideal)
Parsing libraries:
BeautifulSoup - Easy, flexible
lxml - Fast, powerful
html5lib - Strict HTML5 parsing
selectolax - Very fast parser
Frameworks:
Scrapy - Full scraping framework
newspaper3k - Article extraction
trafilatura - Content extraction
🎯 Scraper Architecture
Layered design:
1. Transport Layer (async)
- HTTP client
- Connection pooling
- Rate limiting
2. Fetching Layer (async)
- URL management
- Retry logic
- Error handling
3. Parsing Layer (sync + threading)
- HTML parsing
- Data extraction
- Validation
4. Storage Layer (async or sync)
- Database writes
- File storage
- Caching
SmartAsync bridges layers 2-3 automatically.
🔍 Common Challenges
Rate limiting:
Implement delays between requests
Respect Retry-After headers
Exponential backoff
Per-domain rate limits
Session management:
Login flows
Cookie handling
CSRF tokens
Session persistence
Anti-scraping measures:
CAPTCHA detection
IP blocking
JavaScript challenges
User-Agent checking
Solutions:
Rotate User-Agents
Use proxy rotation
Implement delays
Handle CAPTCHAs appropriately
📊 Performance Expectations
Sync scraping:
1 page/second (sequential)
Limited by network latency
Simple code
Async scraping (with SmartAsync):
10-50 pages/second (concurrent)
Limited by bandwidth and rate limits
Clean code
Automatic thread offloading for parsing
Scrapy (comparison):
50-100+ pages/second
More complex setup
Full framework
🎯 Success Metrics
Your scraper is successful when:
High concurrency without complexity
Clean separation of fetch/parse
Respects site resources
Robust error handling
Good performance
Easy to maintain
Next Steps:
See 01: Sync App → Async Libraries for CLI scraper tools
Check 09: Interactive Environments for prototyping