Pipeline Stage Design and Contracts

The Stage Contract:
Every pipeline stage is a black box with three critical specifications. First, the input contract: what data format, schema, and assumptions does this stage require? Second, the output contract: what format, guarantees, and quality does it produce? Third, the performance contract: what throughput and latency targets must it meet?

Consider a real enrichment stage. Input contract: "Events must have user_id, timestamp, and schema_version fields. Timestamps must be within 24 hours of current time." Output contract: "Adds geo_enriched field with country, region, and city. Events are deduplicated by event_id within a 24 hour window." Performance contract: "Processes 50k events per second per instance with p99 latency under 100ms."

Performance Contract Targets
50k
EVENTS/SEC
100ms
P99 LATENCY
Stateless vs Stateful Stages:
Most stages should be stateless: they transform each record independently without remembering previous records. This makes them trivial to parallelize. Spin up 20 instances, and each processes 1/20th of the traffic. Stateless stages also recover easily from failures: just restart and continue.

But some stages need state. Aggregation stages compute per user counts or rolling windows. Join stages need to hold one side of the join in memory or storage. For stateful stages, you must externalize the state. Instead of keeping counters in memory, write them to a key value store or state backend. This enables horizontal scaling: partition state by key, and each worker handles a subset of keys. It also enables recovery: if a worker crashes, another can load its state partition and continue.

Parallelism Through Partitioning:
To scale a stage from 10k to 200k events per second, you cannot just make one instance faster. You partition the data stream and run multiple instances in parallel. The most common approach is partitioning by key. Hash the user_id, and send all events for user 123 to worker 3. This preserves per user ordering, which is often necessary for correctness.

The math matters. If one stage instance handles 10k events per second and you need 200k per second throughput, you need at least 20 workers. In practice, add headroom: use 24 to 30 workers to handle traffic spikes and leave room for rolling deployments without dropping below capacity.

⚠️ Common Pitfall: Partitioning by user_id works great until you have power users. One user generates 100 times more events than average. That worker becomes a hot spot, causing p99 latency spikes. The solution is secondary partitioning: hash on user_id plus event_type, or use consistent hashing with virtual nodes to spread hot keys.
Transport Between Stages:
For streaming pipelines, stages communicate through message queues or distributed logs like Kafka. The queue provides buffering: if Stage 3 slows down temporarily, the queue between Stage 2 and 3 absorbs the backlog, preventing upstream disruption. The queue also enables replay: you can reprocess historical data by resetting the consumer offset.

For batch pipelines, stages write output files to object storage or a data lake, and the next stage reads those files. This is simpler but higher latency. A batch stage might take 5 minutes to run and write output, then the next stage waits for all output files before starting. End to end latency for a 5 stage batch pipeline could be 30 to 60 minutes, compared to seconds for streaming.

💡 Key Takeaways

✓Each stage defines three contracts: input requirements, output guarantees, and performance targets like 50k events/sec at p99 100ms latency

✓Stateless stages transform records independently and scale trivially; stateful stages externalize state to key value stores for horizontal scaling

✓Partitioning by key enables parallelism: for 200k events/sec at 10k per worker, deploy at least 20 workers with headroom for spikes

✓Message queues between stages provide buffering and replay capability; batch stages use file storage with higher latency but simpler semantics

📌 Interview Tips

1An enrichment stage contract: Input requires user_id and timestamp within 24h. Output adds geo_enriched field and deduplicates by event_id. Performance: 50k events/sec per instance, p99 under 100ms.

2A stateful aggregation stage partitions by user_id across 30 workers. Each worker maintains counters in Redis for its partition (user_id % 30). If a worker crashes, another loads state for that partition and resumes counting.

← Back to Pipeline Architecture Patterns Overview