Loading...
Data Pipelines & Orchestration • Idempotency in Data PipelinesMedium⏱️ ~3 min
Trade-offs: Append Only vs Idempotent Upserts
The Fundamental Trade-off:
Every data pipeline makes a choice between write simplicity and read correctness. You can make writes fast and dumb, or you can make them smart and safe. Both have a cost.
Append Only (Fast Writes, Complex Reads):
With pure append, every write is an INSERT. No key lookups, no conflict checks, no coordination. At 100,000 events per second, you can sustain sub 10ms p99 latency to a distributed log like Kafka. This is why raw ingestion is almost always append only.
The cost appears downstream. Consumers must deduplicate and resolve conflicts. Queries need GROUP BY with DISTINCT or complex window functions to pick the latest version. Storage grows linearly with retries: 5% duplicate rate means 5% wasted storage. For petabyte scale systems, that's expensive.
This pattern works well for immutable event logs where readers are sophisticated data engineers. It fails for user facing analytics where business analysts run ad hoc SQL. They shouldn't need to know about deduplication logic.
Idempotent Upsert (Safe Writes, Simple Reads):
With upserts, each write checks for an existing key and either inserts or updates. At 100,000 operations per second, this requires careful indexing and sharding. Write latency increases to 20 to 50ms p99 because of the lookup cost. Hot keys can create bottlenecks: if all events hash to the same partition, throughput drops significantly.
The benefit is clean, simple downstream consumption. Queries just SELECT without worrying about duplicates. Storage is efficient because there's one row per business entity. This is essential for customer facing tables, financial reporting, and ML feature stores where correctness is non negotiable.
When to Choose What:
Raw logs and event streams: append only. Maximizes ingestion throughput, accepts duplicates. Examples: click streams, application logs, IoT sensor data at millions of events per second.
Derived aggregates and dimensions: idempotent upserts. Enables safe recomputation and backfills. Examples: user profiles, daily revenue by merchant, ML features, any table used by BI or operational dashboards.
Financial transactions: idempotent with extra care. Use database transactions or distributed sagas to ensure upserts are atomic with related state changes. Cannot tolerate even temporary duplicates.
Append Only
Sub 10ms writes, duplicates in data, complex downstream dedupe
vs
Idempotent Upsert
20 to 50ms writes, clean data, simple downstream reads
"Choose append only for high volume raw ingestion. Choose idempotent upserts for derived tables that others depend on. Mix both in the same pipeline."
Throughput Impact
100k/sec
APPEND ONLY
30k/sec
UPSERT WITH INDEX
💡 Key Takeaways
✓Append only writes achieve sub 10ms p99 latency at 100k+ events/sec but push deduplication complexity to all consumers
✓Idempotent upserts increase write latency to 20 to 50ms and reduce throughput by ~60% but eliminate downstream complexity
✓Choose append only for raw ingestion and high volume streams where sophisticated consumers can handle dedupe
✓Choose idempotent upserts for derived tables, aggregates, and any data consumed by BI tools or less technical users
✓Storage cost matters: 5% duplicate rate at petabyte scale wastes significant money, favoring idempotent sinks for long term storage
📌 Examples
1High volume append: Kafka ingests 500k events/sec with 8ms p99 latency. Downstream Flink job deduplicates using stateful processing before writing to warehouse.
2Idempotent aggregate: Daily revenue table uses MERGE statement keyed by <code>merchant_id</code> and <code>date</code>. Batch job can rerun safely to fix bugs or incorporate late data.
3Mixed strategy: Raw clickstream appends to S3 at 1M events/sec. Hourly job reads S3, dedupes by <code>event_id</code>, computes aggregates, and upserts into Redshift for BI queries.
Loading...