Web Crawler Architecture: URL Frontier and Politeness Scheduling

A production web crawler is not a simple loop fetching URLs. At scale, it becomes a distributed pipeline where the URL frontier acts as the brain, managing billions of URLs while enforcing politeness constraints. The frontier uses a two tier architecture: front queues rank URLs by importance and freshness (homepages, news pages in minutes; deep archive pages in weeks), while back queues enforce per host and per IP rate limits.

The scheduler works by picking a front queue based on priority (weighted round robin), then selecting an eligible back queue whose host has available capacity. Each host/IP gets a token bucket with configurable rates, typically 1 to 2 concurrent requests per host to avoid overwhelming servers. To achieve 1000 queries per second (QPS) globally with 300 millisecond (ms) median latency, you need 500 to 1500 different hosts actively in flight simultaneously.

Adaptive recrawl is critical for freshness. Each URL maintains a change probability estimate based on observed history. High change pages (probability near 1.0) get recrawled in minutes; stable pages (probability near 0.0) wait weeks. Google's Caffeine update in 2010 demonstrated this, delivering 50% fresher results by moving from batch to continuous incremental indexing. News sites see recrawl intervals of minutes; deep stable content might wait months.

Failure mode: Spider traps are infinite URL spaces created by calendars, session identifiers (IDs), or faceted search. A misconfigured crawler can generate millions of useless URLs from a single site. Mitigation requires per host URL caps, depth limits, and query parameter entropy checks to detect generative patterns.

💡 Key Takeaways

•Two tier frontier: front queues for priority ranking, back queues for per host politeness with token buckets limiting to 1 to 2 concurrent requests per host

•Achieving 1000 QPS at 300ms median latency requires 500 to 1500 distinct hosts in flight simultaneously, demonstrating why crawler scale depends on host diversity

•Adaptive recrawl uses change probability estimates: high change pages (news, homepages) recrawled in minutes, stable deep content recrawled in weeks to months

•Spider trap protection: per host URL caps, depth limits, and query parameter entropy analysis prevent infinite calendar or session ID explosions

•Google Caffeine (2010) showed 50% fresher results by moving from batch to continuous incremental crawling with adaptive scheduling

•Per IP and per subnet throttles are essential beyond per host limits because multi tenant hosting means one IP serves hundreds of domains sharing capacity

📌 Examples

Googlebot serves busy sites at several requests per second (rps) bursts but throttles small origins to under 1 rps, dynamically adjusting based on server response latency and error rates

Common Crawl executes monthly crawls of 2 to 8 billion pages at average 1500 pages per second (pages/s), but parallelizes across thousands of workers to finish in days with much higher instantaneous throughput

Bing's host level load balancing explicitly tracks server health and uses exponential backoff on 429/503 responses, setting next allowed fetch timestamp based on Retry After headers or doubling delay up to a maximum cap

← Back to Web Crawler Design Overview