Resilience & Service Patterns • Web Crawler DesignHard⏱️ ~3 min
Distributed Crawler Partitioning: Consistent Hashing and Agent Coordination
At scale beyond tens of millions of URLs, a single crawler agent cannot manage the frontier or maintain politeness for the diversity of hosts. The solution is partitioning the frontier by host (or URL hash) using consistent hashing across many crawler agents. Each agent becomes responsible for a subset of host buckets and maintains its own front queues (priority) and back queues (per host politeness). Consistent hashing ensures that adding or removing agents causes minimal rehashing (only K divided by N URLs move when adding an agent to N agents with K total URLs).
Each agent runs thousands of concurrent fetches across many hosts but enforces strict per host concurrency (1 to 2 concurrent) using token buckets and next allowed fetch timestamps. To achieve internet scale throughput (tens of thousands of pages per second), you need hundreds to thousands of agents. Google's infrastructure likely runs tens of thousands of crawler agents worldwide, each handling a portion of the host space. This architecture is at least once: scheduling and fetching can retry on failure, with deduplication ensuring eventual idempotence.
The trade off is centralized versus sharded frontier. A centralized global priority queue is simpler for best first algorithms (always fetch highest priority URL globally) but becomes a bottleneck and single point of failure at 100 million to 10 billion URL scales. Sharded frontiers scale linearly by adding agents but sacrifice perfect global ordering: an important URL on host A might be queued behind low priority URLs on host B simply due to politeness constraints. Periodic rebalancing and cross agent priority hints mitigate this.
Failure mode: hot partitions. A celebrity user or viral content event can cause one URL bucket (and thus one agent) to receive 80% of inbound discovered URLs. That agent's frontier depth explodes, its back queues for popular hosts saturate, and overall system latency spikes. For example, p99 latency might jump from 10 milliseconds (ms) to 500 ms as one agent struggles. Mitigation requires dynamic rebalancing (split hot buckets), overflow queues (route excess URLs to other agents), or virtual nodes in consistent hashing (map each physical agent to multiple points on the hash ring for better load distribution).
💡 Key Takeaways
•Consistent hashing partitions frontier by host hash across agents; adding agent to N agents moves only K divided by N URLs (minimal rehash), enabling linear scale
•Each agent maintains own front and back queues, enforcing 1 to 2 concurrent per host; internet scale (tens of thousands pages/s) requires hundreds to thousands of agents
•At least once semantics: scheduling and fetching retry on failure, deduplication ensures idempotence; trade consistency for availability and partition tolerance
•Centralized frontier simpler for best first (global priority order) but bottleneck at 100 million to 10 billion URLs; sharded frontier scales linearly but sacrifices perfect ordering
•Hot partition failure: celebrity or viral content causes one agent to receive 80% of URLs, frontier depth explodes, p99 latency spikes from 10ms to 500ms as that agent saturates
•Mitigation strategies: dynamic rebalancing (split hot buckets), virtual nodes (map each physical agent to multiple hash ring points), overflow queues (route excess to other agents)
📌 Examples
Google's likely architecture: tens of thousands of crawler agents globally, each responsible for host hash buckets, with periodic rebalancing and cross datacenter frontier synchronization
Common Crawl's distributed fetch: thousands of workers across cloud regions, each assigned URL ranges via consistent hashing, with central coordination only for dedup and archive storage
Hot partition scenario: trending Twitter topic causes 10 million new URLs discovered for one domain in 1 hour; if that domain hashes to one agent, its frontier jumps from 1 million to 11 million URLs, requiring 6 hours to drain at normal throughput