Resilience & Service Patterns • Web Crawler DesignHard⏱️ ~3 min
Robustness and Failure Modes: DNS Caching, Robots.txt, and Backoff Strategies
Production crawlers operate in a hostile environment filled with pathological edge cases. Domain Name System (DNS) pathologies are common: high latency resolvers (500 milliseconds (ms) to 2 seconds), cyclic Canonical Name (CNAME) records, Not an Existing Domain (NXDOMAIN) storms from typo domains, zero Time To Live (TTL) records forcing repeated lookups, and DNS based blocking. Aggressive DNS caching with respect to TTL (typically 300 to 3600 seconds) and negative caching (cache NXDOMAIN for 60 to 300 seconds) are mandatory. Isolate DNS failures per domain to prevent cascading: set per domain DNS queries per second (QPS) limits and fall back to cached stale records on resolver timeouts.
Robots.txt parsing is legally and operationally critical but fraught with ambiguity. Cache robots.txt per host with TTL and revalidate on HTTP 4xx or 5xx. Parse Disallow and Allow precedence (longest prefix match wins) and crawl delay hints. On ambiguous or malformed robots.txt, fall back to conservative defaults: assume Disallow for everything and set crawl delay to 1 second. Misparsing robots can lead to legal exposure or missed content; test against reference implementations.
Backoff and retry strategies prevent host overload and bans. Implement multi level throttles: per URL timeouts and size limits (10 megabytes (MB), 10 seconds), per host token buckets (1 to 2 concurrent), per IP and subnet aggregate tokens (prevent multi tenant overload), and global max concurrent fetches (cap total outstanding). On 429 Too Many Requests or 503 Service Unavailable with Retry After header, set next allowed fetch accordingly; without hints, use exponential backoff with jitter (double delay up to 60 to 300 second cap). On Transport Layer Security (TLS) or DNS errors, apply slower backoff (quadruple delay) and reduce concurrency.
Cascading failure mode: one slow host can exhaust connection pools and worker threads, causing backpressure into the frontier and eventually head of line blocking. Mitigate with per host timeouts (strict 5 to 10 second read timeout), circuit breakers (auto graylist hosts with over 50% error rate or p95 latency over 10 seconds for 5 to 15 minutes), and bulkheads (isolate host groups so one bad actor cannot monopolize global resources).
💡 Key Takeaways
•DNS caching critical: cache positive responses per TTL (300 to 3600 seconds typical), negative cache NXDOMAIN for 60 to 300 seconds, isolate failures per domain with QPS limits
•Robots.txt parsing: cache per host with TTL, parse Disallow/Allow longest prefix match, fall back to conservative defaults (Disallow all, 1 second crawl delay) on ambiguous or malformed input
•Multi level throttles: per URL (10 MB, 10 second timeout), per host (1 to 2 concurrent tokens), per IP/subnet (aggregate tokens), global (max outstanding fetches)
•Exponential backoff on 429/503: use Retry After header if present, otherwise double delay with jitter, cap at 60 to 300 seconds; quadruple delay on TLS/DNS errors
•Circuit breaker pattern: auto graylist hosts with over 50% error rate or p95 latency over 10 seconds for 5 to 15 minutes, probe with single request before resuming
•Cascading failure risk: one slow host exhausts connection pools causing head of line blocking; mitigate with strict per host timeouts (5 to 10 seconds read), bulkheads isolating host groups
📌 Examples
Google's dynamic crawl budgets: reduce per host concurrency on rising latency (if p95 jumps from 300ms to 2 seconds, drop from 2 to 1 concurrent), apply exponential backoff on 429 responses
DNS pathology example: a misconfigured authoritative nameserver returns zero TTL for all records, causing crawler to re resolve on every fetch; negative caching with minimum 60 second TTL prevents query storm
Robots.txt ambiguity: Disallow /private and Allow /private/public create conflicting rules; longest prefix match (Allow /private/public) wins, but conservative parser might Disallow both; test against reference corpus