# Scenario 08: Web Scraping **Target Audience**: Scraper developers, data engineers **Difficulty**: Intermediate **Keywords**: web scraping, beautifulsoup, lxml, aiohttp, httpx, concurrent fetching --- ## 📋 The Problem Web scraping has two distinct phases with different needs: **Phase 1: Fetching (I/O bound)** - Download HTML from many URLs - Network-bound operation - Perfect for async/await - Want high concurrency **Phase 2: Parsing (CPU bound)** - Parse HTML with BeautifulSoup/lxml - CPU-intensive operation - Synchronous libraries - Blocks event loop if not careful **Traditional approaches**: **All Sync**: - Simple to write - Sequential fetching (slow) - Can't exploit concurrency - Wastes time waiting **Manual Async**: - Async fetch with aiohttp - Manual threading for parsing - Complex coordination - Error handling tricky - Boilerplate heavy **The dilemma**: Need async for fetching, sync for parsing. --- ## 💡 Solution with SmartAsync **Natural separation of concerns**: - Async fetch methods (network I/O) - Sync parsing methods (CPU work) - SmartAsync handles coordination - Clean, maintainable code - Automatic thread offloading **Pattern**: ``` Scraper ├─→ fetch_url() - async (concurrent) ├─→ fetch_many() - async (concurrent) ├─→ parse_html() - sync (CPU-bound) └─→ extract_data() - sync (BeautifulSoup) ``` Framework coordinates async fetching + sync parsing automatically. --- ## 🎯 When to Use **Ideal for**: - Multi-page scraping - Concurrent URL fetching - BeautifulSoup/lxml parsing - Data extraction pipelines - Content aggregation - Price monitoring - Social media scraping **Perfect when**: - Need to fetch 100+ pages - Parsing is CPU-intensive - Want clean code separation - Using sync parsing libraries - Building scraping framework --- ## ⚠️ Considerations ### Design Patterns **Concurrent fetching**: - Use asyncio.gather() for multiple URLs - Respect rate limits - Handle retries gracefully - Connection pooling **Parsing strategy**: - Parse in threads (don't block loop) - Consider batch parsing - Reuse BeautifulSoup parsers - Cache parsed results **Error handling**: - Network errors (timeouts, 404s) - Parse errors (malformed HTML) - Rate limiting (429 responses) - Graceful degradation ### Performance Optimization **Fetching**: - Connection pooling (httpx.AsyncClient) - Concurrent requests limit - Adaptive rate limiting - DNS caching **Parsing**: - Choose right parser (lxml vs html.parser) - Parse only what you need - Stream large documents - Consider parallel parsing **Memory management**: - Don't keep all HTML in memory - Process and discard - Use generators - Monitor memory usage ### Ethical Considerations **Best practices**: - Respect robots.txt - Implement rate limiting - Use reasonable concurrency - Set User-Agent header - Cache responses - Don't hammer servers **Legal considerations**: - Check terms of service - Respect copyright - Follow data protection laws - Rate limiting requirements ### When NOT to Use **Avoid if**: - Scraping single page (no concurrency benefit) - Site has official API (use that instead) - JavaScript-heavy site (need browser automation) - Real-time scraping (need websockets) **Better alternatives**: - **Scrapy** - Full-featured framework - **Playwright/Selenium** - Browser automation - **Official APIs** - Always prefer if available --- ## 🔗 Related Scenarios - **01: CLI Tools** - Scraper as CLI tool - **08: Interactive Environments** - Prototyping scrapers in Jupyter - **06: Plugin Systems** - Scraping pipelines with plugins --- ## 📚 Technology Stack **Fetching libraries**: - **httpx** - Modern async HTTP client - **aiohttp** - Popular async HTTP - **requests** - Traditional sync (not ideal) **Parsing libraries**: - **BeautifulSoup** - Easy, flexible - **lxml** - Fast, powerful - **html5lib** - Strict HTML5 parsing - **selectolax** - Very fast parser **Frameworks**: - **Scrapy** - Full scraping framework - **newspaper3k** - Article extraction - **trafilatura** - Content extraction --- ## 🎯 Scraper Architecture **Layered design**: ``` 1. Transport Layer (async) - HTTP client - Connection pooling - Rate limiting 2. Fetching Layer (async) - URL management - Retry logic - Error handling 3. Parsing Layer (sync + threading) - HTML parsing - Data extraction - Validation 4. Storage Layer (async or sync) - Database writes - File storage - Caching ``` SmartAsync bridges layers 2-3 automatically. --- ## 🔍 Common Challenges **Rate limiting**: - Implement delays between requests - Respect Retry-After headers - Exponential backoff - Per-domain rate limits **Session management**: - Login flows - Cookie handling - CSRF tokens - Session persistence **Anti-scraping measures**: - CAPTCHA detection - IP blocking - JavaScript challenges - User-Agent checking **Solutions**: - Rotate User-Agents - Use proxy rotation - Implement delays - Handle CAPTCHAs appropriately --- ## 📊 Performance Expectations **Sync scraping**: - 1 page/second (sequential) - Limited by network latency - Simple code **Async scraping** (with SmartAsync): - 10-50 pages/second (concurrent) - Limited by bandwidth and rate limits - Clean code - Automatic thread offloading for parsing **Scrapy** (comparison): - 50-100+ pages/second - More complex setup - Full framework --- ## 🎯 Success Metrics Your scraper is successful when: - High concurrency without complexity - Clean separation of fetch/parse - Respects site resources - Robust error handling - Good performance - Easy to maintain --- **Next Steps**: - See [01: Sync App → Async Libraries](01-sync-app-async-libs.md) for CLI scraper tools - Check [09: Interactive Environments](09-interactive-environments.md) for prototyping