Resilience & Service Patterns • Web Crawler DesignHard⏱️ ~3 min
Rendering Strategy: Balancing HTML Only vs JavaScript Execution
Modern web pages often require JavaScript (JS) execution to render content, but rendering is 5 to 50 times more expensive than HTML parsing. A single HTML fetch plus parse takes under 100 milliseconds (ms) with median 30 to 80 kilobytes (KB) transfer. Rendering the same page with JS execution takes 0.5 to 3 seconds median wall clock time, requires 512 megabytes (MB) to 1 gigabyte (GB) memory per instance, and consumes 10 to 100 times more central processing unit (CPU) cycles. At 1000 pages per second (pages/s) throughput, rendering everything would require hundreds of expensive machines; HTML only needs a handful.
The solution is selective rendering gated by heuristics. Check Multipurpose Internet Mail Extensions (MIME) type (text/html only), detect JS framework signatures (React, Angular, Vue identifiers in HTML), look for structured data markers (JSON Linked Data or JSON LD, Open Graph), or track historical render necessity per domain. Cap render share at 10% to 20% of total crawl budget. Isolate rendering in sandboxed environments with strict quotas: 3 second wall clock timeout, 512 MB memory limit, and kill on quota exceeded.
Microsoft Bing's evergreen rendering approach demonstrates this: Bingbot supports contemporary JS engines but carefully bounds rendering concurrency. A 32 core box can handle a few hundred renders per second under load, versus tens of thousands of HTML parses per second. Pre fetching static assets selectively and blocking third party trackers stabilize render times and reduce variance. The trade off is coverage: you might miss 5% to 15% of meaningful content on JS heavy sites if heuristics fail.
Failure mode: content bombs and parser hazards. Malicious or broken pages can serve compression bombs (1 MB gzipped expands to 10 GB), giant Document Object Models (DOMs) with millions of nodes, or pathological CSS/JS that causes exponential parse time. Enforce strict byte limits (10 MB raw HTML maximum), deflate ratio limits (under 100x expansion), and per phase timeouts to prevent resource exhaustion.
💡 Key Takeaways
•Rendering is 5 to 50 times more expensive: HTML fetch and parse takes under 100ms and minimal CPU, JS rendering takes 0.5 to 3 seconds and 512 MB to 1 GB memory per page
•Throughput difference: HTML only crawlers achieve 10,000 pages per second per node, rendering limited to 100 to 300 pages per second per node on same hardware
•Selective rendering heuristics: gate on MIME type, JS framework detection (React/Angular/Vue), structured data presence (JSON LD, Open Graph), or per domain historical necessity
•Budget constraints: cap rendering at 10% to 20% of total crawl volume to maintain overall throughput and cost efficiency at scale
•Bing's evergreen rendering shows that 32 core machines handle a few hundred renders per second under load, requiring isolation and strict resource quotas
•Protection against content bombs: enforce 10 MB raw HTML maximum, under 100x deflate ratio limits, 3 second wall clock timeouts, and kill on memory quota exceeded
📌 Examples
Google's approach: HTML only for most pages, render only when structured data hints, user agent signals, or domain reputation suggests JS reliance, keeping render share under 15%
Common Crawl primarily HTML only: at 4 billion pages per month, rendering even 10% would require 10x more compute infrastructure and extend crawl duration from days to weeks
Real cost at 1000 pages/s: HTML only needs 5 to 10 fetch nodes at $500 per month each ($2.5K to $5K total), rendering 100% would need 50 to 100 render nodes at $1000 per month each ($50K to $100K total)