Real-time Analytics & OLAPData Freshness vs Consistency Trade-offsHard⏱️ ~3 min

Failure Modes: What Breaks at the Edges

When you optimize for freshness or relax consistency, several subtle failure modes emerge in production. Understanding these edge cases is critical for system design interviews and real implementations. Failure Mode One: Replication Lag Spikes and Stale Reads Read replicas typically lag behind the primary by 100ms to 1 second in normal conditions. But during traffic spikes, deployments, or network partitions, lag can spike to 10 seconds or more. This violates user expectations catastrophically. Concrete example: A user updates their shipping address and clicks "Save". The write commits to the primary in 10ms. The page refreshes and reads from a replica that is currently 8 seconds behind due to a replication backlog. The user sees their old address and thinks the update failed. They change it again, creating a duplicate write.
Replication Lag Under Load
NORMAL
100ms
SPIKE
8 sec
USER SEES
Stale data
Mitigation: Implement read your writes consistency by routing a user's reads to the primary for 5 to 10 seconds after they write, or use session tokens that encode the last committed transaction ID. Monitor replication lag as a first class metric and alert when it exceeds 2 seconds p99. During lag spikes, you might temporarily route all reads to the primary, accepting higher latency to maintain correctness. Failure Mode Two: Out of Order Events in Streaming Pipelines In systems optimized for freshness, events can arrive out of order due to network retries, partition reassignments, or parallel processing. This breaks assumptions about event ordering. Example from production: A user updates their email at timestamp T1, then deletes their account at timestamp T2. The delete event is processed first because it was in a different Kafka partition with lower lag. The derived store processes the delete, removing the user. Then the email update event arrives and recreates the user with the new email, effectively resurrecting a deleted account. At Uber scale, with millions of trip events per second, out of order arrival is common. A trip status update "trip completed" might arrive before "trip started" if the driver's phone had intermittent connectivity. Without proper handling, your analytics could show impossible states like 30 minute trip duration when the start event has not yet been processed. Mitigation: Use event time rather than processing time, and implement watermarks that bound lateness. For example, accept events up to 10 minutes late, and only finalize aggregations after the watermark passes. Use idempotent operations and last write wins with timestamps. Add compensating transactions: if a late event arrives that invalidates earlier aggregations, emit correction events. Failure Mode Three: Cross System Inconsistencies With multiple derived views updated asynchronously, an entity can exist in one system but not another, or have different values across systems during propagation windows. Real scenario: An order is written to the orders database (immediately visible to customer support), sent via email within 2 seconds (customer has confirmation), appears in the warehouse management system after 30 seconds (ready for fulfillment), and lands in the business intelligence warehouse after 5 minutes (visible to executives). If the customer calls support at the 1 minute mark, support sees the order but the warehouse does not, causing confusion. This gets worse during backfills or reprocessing. When you rebuild a derived index, you might temporarily have double counts or missing data, causing metrics to spike or drop unexpectedly. Mitigation: Maintain an audit log or event sourcing system as the authoritative source of truth. Each derived system stores metadata about the last event it processed, allowing you to detect and reconcile drift. Run continuous data quality checks that compare row counts and key metrics across systems. Use schema versioning and backward compatible changes when evolving data formats. Failure Mode Four: Clock Skew and Timestamp Ordering If your freshness and ordering logic relies on wall clock timestamps from distributed servers, clock skew can cause events to appear in the wrong order. Network Time Protocol (NTP) typically keeps clocks within 100ms, but during NTP failures or misconfigurations, skew can reach seconds. A write at 10:00:00.500 on server A might be assigned an earlier timestamp than a write at 10:00:00.400 on server B if server B's clock is 200ms ahead. This breaks causal ordering and can cause newer data to be overwritten by older data in last write wins systems. Mitigation: Use logical clocks (Lamport timestamps or vector clocks) for ordering within a partition or shard. Use hybrid logical clocks that combine physical time with logical counters, providing both human readable timestamps and correct causal ordering. In critical systems like Spanner, use atomic clocks and GPS for precise time synchronization with bounded uncertainty.
⚠️ Interview Insight: When discussing freshness optimizations, always mention how you would detect and handle these failure modes. Showing you understand replication lag monitoring, watermarks for late events, and clock skew mitigations demonstrates production maturity.
💡 Key Takeaways
Replication lag normally 100ms to 1 second can spike to 10+ seconds during load, causing users to not see their own writes and triggering duplicate submissions
Out of order event processing in streaming systems can resurrect deleted entities or create impossible states like completed trips that never started
Multiple derived stores updated asynchronously create windows where the same entity has different values across systems, confusing users and support teams
Clock skew between servers can cause newer writes to be assigned earlier timestamps, breaking last write wins logic and causal ordering assumptions
Production systems require monitoring for lag spikes, watermarks for late events, reconciliation jobs for cross system drift, and logical clocks for correct ordering
📌 Examples
1A messaging app sees replication lag spike from 200ms to 12 seconds during deployment, causing 5% of users to report "message not delivered" even though writes succeeded to the primary
2During a Black Friday traffic spike, an e-commerce platform's CDC pipeline falls 3 minutes behind, causing inventory counts in the search index to be dangerously stale and leading to 500 oversold items
3Netflix detects clock skew of 2 seconds on a subset of playback servers, causing some viewing history updates to be lost when older timestamps overwrite newer ones in their eventually consistent store
← Back to Data Freshness vs Consistency Trade-offs Overview
Failure Modes: What Breaks at the Edges | Data Freshness vs Consistency Trade-offs - System Overflow