Distributed Data Processing • Spark Architecture & Execution ModelMedium⏱️ ~2 min
Spark in Production: Batch ETL and Streaming at Scale
The Production Architecture: In real world deployments, Spark sits between storage systems like S3 or HDFS and serving systems like data warehouses or key value stores. Raw data flows in from sources like Kafka, lands in object storage, gets transformed by Spark, and lands back in curated datasets that downstream tools can query.
At Netflix, raw events such as play starts, errors, and device telemetry arrive at millions of events per second. These land in S3 as compressed files. Spark clusters on EMR or Databricks read tens to hundreds of terabytes per day, perform ETL, join with dimension tables, compute aggregates, and write results back to S3 in columnar formats like Parquet. Downstream tools like Presto query these curated tables with p50 latencies in the hundreds of milliseconds.
Batch ETL Sizing: A typical daily batch pipeline runs with 100 to 500 executors, each having 8 to 16 cores and 64 to 128 GB of memory. These clusters scan 10 TB of columnar data and perform multi way joins in a few minutes when tuned correctly, with p50 stage times in seconds and end to end job times under 10 minutes. To process a 20 TB daily partition within 30 minutes, engineers provision enough aggregate CPU and memory, tune shuffle partitions to keep tasks handling 100 to 300 MB each, and enable AQE for dynamic optimization.
Structured Streaming: For real time use cases, Spark Structured Streaming processes data in micro batches every 500 milliseconds to 5 seconds. At Uber, Kafka feeds events at hundreds of thousands to a few million events per second for use cases like fraud detection or surge pricing. Spark maintains state in memory, checkpoints periodically for fault tolerance, and can keep end to end latency from event ingestion to derived signal under 2 to 5 seconds p50 and 10 to 20 seconds p99 with proper backpressure configuration.
The same clusters often support both streaming and batch workloads by scheduling different jobs in different time windows. Streaming jobs run continuously with smaller resource allocations, while batch jobs run on schedules (hourly, daily) with larger bursts of compute.
The Reality Check: Production systems fail or degrade not from core engine bugs but from resource imbalances, data characteristics, or external limits. A single key with 10x more records than others creates a straggler task. Large shuffles stress disk and network. External dependencies like Hive metastores or S3 rate limits often become the bottleneck before CPU saturates.
Production Cluster Scale
200
EXECUTORS
10 TB
DAILY SCAN
< 10 min
JOB TIME
❗ Remember: Trigger intervals in streaming must be shorter than batch processing time. A 1 second trigger on a cluster needing 1.5 seconds per batch causes unbounded lag and state growth.
💡 Key Takeaways
✓Production Spark clusters with 100 to 500 executors process 10 to 20 TB daily in under 10 minutes for batch ETL, with p50 stage times in seconds when properly tuned
✓Structured Streaming uses micro batches every 500 milliseconds to 5 seconds, processing millions of events per second with end to end latency of 2 to 5 seconds p50 for real time use cases
✓Engineers size clusters to match peak volumes within SLAs, tune shuffle partitions to 100 to 300 MB per task, and enable AQE to dynamically optimize joins and coalesce partitions
✓Real bottlenecks come from data skew creating straggler tasks, large shuffles stressing disk and network, and external system limits like metastore latency or storage rate limits
📌 Examples
1Netflix processes hundreds of terabytes per day from S3, with Spark performing ETL and multi way joins on raw events, writing curated Parquet datasets that Presto queries with p50 latencies under 500 milliseconds
2Uber runs Structured Streaming for fraud detection with 500 millisecond to 5 second micro batches, maintaining in memory state and checkpointing to achieve 2 to 5 second end to end latency from Kafka to output