Big Data Systems • Log Aggregation & AnalysisMedium⏱️ ~3 min
Log Aggregation Pipeline Architecture and Data Flow
A production grade log aggregation pipeline follows a decoupled architecture with distinct stages, each optimized for different constraints. Understanding this flow is critical for reasoning about latency, durability, and failure modes.
At the edge, lightweight collection agents run on every host, tailing log files or intercepting stdout/stderr streams. These agents must operate within strict resource budgets: typically under 1 to 2 percent CPU and 100 to 200 MB memory in steady state. They maintain an on disk ring buffer sized for 5 to 30 minutes of burst traffic (5 to 30 GB) to survive temporary downstream outages. When backpressure builds, agents either drop low priority logs or block application writes, making buffer sizing and sampling policies critical.
Logs flow to a durable buffer layer, typically a message bus like Kafka or write optimized ingesters. This layer provides at least once delivery semantics and must be sized for peak ingest plus 2 to 6 hours of backlog capacity. Partitioning by tenant or service plus time enables parallelism downstream. For example, Uber runs multi cluster Kafka as their telemetry backbone handling billions of spans and logs daily, with careful partitioning to avoid hot shards.
The transform and enrichment layer sits between ingestion and storage. Here you attach metadata like service name, environment, region, build SHA, container IDs, and critically, correlation IDs for distributed tracing. This stage also normalizes timestamps, stitches multi line stack traces, and can derive metrics from logs for low latency alerting. Finally, data splits between hot storage with full or label based indexing for recent data and cold object storage for long term retention. Query routers fan out requests by time and tenant, merging results while enforcing per query limits to prevent blast radius from expensive queries.
💡 Key Takeaways
•Edge agents operate under strict resource budgets of under 1 to 2 percent CPU and 100 to 200 MB memory, with 5 to 30 GB on disk ring buffers sized for temporary downstream outages
•Durable buffer layer like Kafka must be sized for peak ingest plus 2 to 6 hours of backlog, providing at least once semantics and partitioning by tenant/service for downstream parallelism
•Transform and enrichment layer attaches service metadata, environment, region, build SHA, container IDs, and correlation IDs, plus normalizes timestamps and stitches multi line logs
•Storage splits between hot tier (7 days, 50 to 400 TB) with low latency indexing and cold object storage (90 to 365 days) with rehydration on demand for cost optimization
•Query routers fan out by time and tenant, merge results, and enforce per query limits (max bytes scanned, timeouts) to prevent expensive queries from causing blast radius incidents
📌 Examples
Uber runs multi cluster Kafka as telemetry backbone for over 10 billion spans per day plus application logs, using Flink/Spark for enrichment and routing to tiered storage with sub 30 second search availability
Netflix Keystone pipeline carries trillions of events daily with seconds level latency, decoupling producers via durable bus, enriching in stream, and fanning out to hot analytics and long term storage