Resilience & Service Patterns • Web Crawler DesignMedium⏱️ ~2 min
Scale Economics: Storage, Bandwidth, and Operational Cost Models
The economics of web crawling at scale are dominated by bandwidth, storage, and compute costs. At 1000 pages per second with 100 KB average compressed page size, inbound bandwidth is approximately 100 megabytes per second (MB/s) or 0.8 gigabits per second (Gbps). This accrues roughly 8.6 terabytes per day (TB/day) or 260 TB per month of raw compressed content. Cloud object storage typically costs $20 to $25 per TB per month, implying $5,000 to $7,000 per month just for storing one month of raw crawl data, excluding compute and egress.
Common Crawl provides concrete benchmarks: monthly crawls of 2 to 8 billion pages produce 200 to 400 TB of compressed archives per crawl. At 4 billion pages per month, that averages 1500 pages per second sustained, but parallelizes across thousands of workers to finish in days with much higher instantaneous throughput. The storage cost alone at $20 per TB is $4,000 to $8,000 per month; compute for fetching, parsing, and dedup adds another $10,000 to $20,000 per month in practice.
Efficiency comes from deduplication and conditional requests. Using ETag and If Modified Since headers, production systems skip 20% to 60% of bytes on stable sites by serving 304 Not Modified responses. Content level dedup reclaims another 20% to 40% of storage. Together, these optimizations reduce effective storage and bandwidth costs by 40% to 70%, turning a $15,000 per month operation into $5,000 to $9,000 per month at the same nominal scale.
Rendering adds another dimension: isolating JS rendering to 10% of pages still requires dedicated render clusters. A 32 core render node at $1,000 per month handles a few hundred pages per second. To render 100 pages per second continuously (10% of 1000 pages/s), you need roughly 1 to 3 render nodes, adding $1,000 to $3,000 per month. Rendering everything would require 10 to 30 render nodes at $10,000 to $30,000 per month, illustrating why selective rendering is mandatory.
💡 Key Takeaways
•At 1000 pages per second and 100 KB per page compressed, bandwidth is 100 MB per second (0.8 Gbps), accruing 8.6 TB per day or 260 TB per month of raw content
•Cloud object storage at $20 to $25 per TB per month means $5,000 to $7,000 per month for one month of crawl data, before compute and egress costs
•Common Crawl's 4 billion pages per month (1500 pages/s average) produces 200 to 400 TB compressed archives, costing $4,000 to $8,000 per month storage plus $10,000 to $20,000 per month compute
•Conditional requests (ETag, If Modified Since) skip 20% to 60% of bytes on stable sites via 304 Not Modified, reducing bandwidth and storage by nearly half
•Deduplication reclaims 20% to 40% storage and indexing IO; combined with conditional requests, total savings reach 40% to 70% of baseline cost
•Rendering economics: 10% render budget adds $1,000 to $3,000 per month (1 to 3 nodes), rendering 100% would cost $10,000 to $30,000 per month (10 to 30 nodes)
📌 Examples
Production crawler at 1000 pages/s: fetch cluster 5 nodes at $500/month ($2.5K), storage $6K/month, render cluster 2 nodes at $1K/month ($2K), total $10.5K/month baseline
With dedup and conditional requests: storage drops to $3K/month (50% savings), bandwidth reduced 40%, total cost $7K/month (33% savings)
Google's scale: estimated billions of pages per day, implying petabytes of ingest monthly, but aggressive dedup, incremental indexing, and stale cache serving keep marginal cost per page under $0.0001