Loading...
Data Pipelines & Orchestration • Idempotency in Data PipelinesMedium⏱️ ~3 min
Idempotency at Production Scale
The Architecture:
At companies like Uber and Netflix, idempotency is designed into every layer of the pipeline, from ingestion to final materialization. The pattern that works at petabyte scale separates concerns: raw ingestion is append only and cheap, while derived tables enforce strict idempotency.
Consider an ads analytics platform processing 200,000 events per second at peak. Mobile and web clients send impression and click events into Kafka. The raw topic is append only, accepting duplicates to maximize ingestion throughput. Write latency stays under 10ms at p99 because there's no dedupe or coordination overhead.
Layer by Layer:
The streaming layer computes real time metrics with 10 second end to end latency at p99. It uses windowed deduplication: maintains a set of seen
The data warehouse receives cleaned, deduped events via a batch loader that runs hourly. Each load is idempotent because it either replaces entire partitions (for example, DROP PARTITION for hour X, then INSERT new data for hour X) or uses MERGE statements keyed by
event_id values for the last 2 hours in memory, sharded by key hash across worker instances. For 200,000 events per second over 2 hours, that's 1.44 billion event IDs to track. Using a compact representation, this fits in 10 to 20 GB of memory distributed across the cluster.
The batch layer recomputes daily and hourly aggregates. Jobs are designed to be fully rerunnable. A daily job for campaign metrics computes from the raw append only log and writes results with a composite key of campaign_id plus date. If yesterday's job had a bug, you fix the code and rerun it. The upsert overwrites the bad data.
Processing Cost Comparison
10ms
RAW APPEND
50ms
DEDUPE + UPSERT
event_id. Backfills work the same way: to reprocess March 2024, you run the pipeline for those dates and let it overwrite the existing partitions.
Feature Store Integration:
ML feature stores have tight freshness requirements, often less than 5 minutes for 95% of features. These systems use upserts keyed by user_id plus feature_name plus timestamp. When a pipeline recomputes features for the last hour due to late data, it simply upserts the new values. The inference service always reads the latest value per key.
✓ In Practice: Netflix processes petabytes of viewing data monthly. They can replay months of logs to add new features or fix computation bugs, confident that idempotent materializations won't corrupt downstream tables.
💡 Key Takeaways
✓Separate raw append only ingestion (cheap, fast) from idempotent derived materializations (correct, replayable)
✓Streaming dedupe at 200,000 events per second requires ~10 to 20 GB memory for 2 hour window of event IDs
✓Batch jobs designed for full reruns using partition replacement or upserts enable safe backfills of months of data
✓Feature stores use composite keys like <code>user_id</code> plus <code>feature_name</code> to maintain freshness under 5 minutes while supporting replays
✓At petabyte scale, idempotency is what enables operational flexibility: resharding, migrations, bug fixes via reprocessing
📌 Examples
1Uber's event stream: Raw trip events append to Kafka at 100k+ events/sec. Derived tables like driver_daily_earnings use <code>driver_id</code> plus <code>date</code> keys for idempotent daily batch jobs.
2Warehouse partition replacement: <code>ALTER TABLE events DROP PARTITION (date='2024-03-15'); INSERT INTO events PARTITION (date='2024-03-15') SELECT * FROM reprocessed_events WHERE date='2024-03-15';</code>
3Streaming windowed dedupe: Flink state stores seen <code>event_id</code> values per 2 hour window, evicting older entries. Memory usage bounded at ~15 GB for 200k events/sec stream.
Loading...