Failure Modes and Edge Cases in Event Data Modeling

The Reality of Production Systems:

Event data models operate in messy, distributed environments. Network retries generate duplicates. Clock skew produces out of order events. Hot users create skewed partitions. Understanding these failure modes is critical for building robust analytics systems.

Duplicates and Idempotency:

Network retries and client buffering easily generate duplicate events. A mobile app sends a purchase_completed event, the network times out, the app retries, and you receive the same event twice. Without deduplication, your revenue metrics are inflated. The solution is to make event processing idempotent by using globally unique event IDs. Within a bounded deduplication window based on event time, typically 24 to 72 hours, if two events share the same ID and source, downstream processors treat the later arrival as a duplicate and drop it.

However, this requires storage overhead. You must maintain a lookup table of recently seen event IDs. At billions of events per day, this lookup can consume hundreds of gigabytes of memory or require fast key value stores like Redis. The deduplication window is a tradeoff: longer windows catch more duplicates but increase storage cost and lookup latency.

Out of Order and Late Arriving Events:

Clock skew between devices produces events that appear to occur in the wrong order when sorted by event time. A user completes a purchase on a mobile device with a clock running 10 minutes fast. The purchase_completed event arrives before the checkout_initiated event when sorted by timestamp. Session boundaries, time to conversion metrics, and funnel analysis break if you assume perfect ordering.

The standard solution is to use watermarks. A watermark declares that all events with timestamps up to time T have been seen. Events arriving after the watermark with timestamps before T are considered late. You can either drop late events, buffer them for a grace period, or trigger recomputation of affected aggregates. Companies typically set watermarks at 5 to 15 minutes behind real time to balance completeness and latency.

⚠️ Common Pitfall: Regulatory constraints require deletion of user data, but events are immutable by design. If your model doesn't include consistent user keys or deletion markers, you cannot reliably remove a user's footprint without destroying shared metrics.
Hot Partitions and Skewed Load:

If you partition only by time, extremely active users or tenants create hot partitions. A large enterprise customer sending 1000 events per second lands all traffic on one partition while others sit idle. This throttles throughput and increases latency. The solution is to partition by a hash of user ID, session ID, or tenant ID in addition to time. This spreads load across many partitions. However, it complicates range queries. Scanning a specific user's events requires checking all partitions for the time range.

Schema Drift Over Time:

Different teams emit similar events with slight variations. One team logs "signup" with field "user_name". Another logs "user_signup" with "username". A third logs "user_signed_up" with "name". Your funnel queries must union three tables with different schemas, breaking comparability. Unwinding drift requires expensive backfills to rewrite historical events with a unified schema. Mature companies prevent this by enforcing schema reviews before production deployment and treating event schemas as contracts with explicit versioning.

High Cardinality Explosions:

Naively emitting a separate event property for every possible attribute causes schema explosion. An ecommerce site with 10,000 product attributes creates events with 10,000 columns. Storage and query engines choke. The solution is to separate high cardinality attributes into dimension tables and join at query time, or use nested structures for variable attributes. For example, store product_id in the event and join to a product_attributes table, rather than embedding all attributes in every product_viewed event.

Privacy and Deletion Challenges:

Events are immutable, but privacy laws require deletion. If a user requests deletion under GDPR, you must remove their data. If your events don't include consistent user keys across all identifiers, you can't find all their data. If you delete events outright, you break aggregate metrics that depend on event counts. The solution is to use tombstone markers or pseudonymization. Instead of deleting events, you overwrite personally identifiable information (PII) fields with null or hashed values, preserving event counts and timestamps for aggregate metrics while removing identifiable data.

💡 Key Takeaways

✓Network retries generate duplicate events. Deduplication using unique event IDs within a 24 to 72 hour window requires maintaining a lookup table that can consume hundreds of gigabytes at billions of events per day.

✓Clock skew causes out of order events. Watermarks set 5 to 15 minutes behind real time balance completeness and latency. Late events require dropping, buffering, or triggering aggregate recomputation.

✓Hot partitions occur when partitioning only by time. An enterprise customer sending 1000 events per second can land all traffic on one partition. Hash partitioning by user ID or tenant ID spreads load but complicates range queries.

✓Schema drift where teams emit similar events with different names and fields breaks funnel analysis. Unwinding requires expensive backfills to rewrite historical events with unified schemas.

✓High cardinality attributes like embedding all product properties in events cause schema explosion. Storing 10,000 attributes per event chokes storage and query engines. Use dimension tables and joins instead.

✓Privacy laws require deletion of immutable events. Tombstone markers or pseudonymization (overwriting PII fields with null or hashed values) preserve aggregate metrics while removing identifiable data.

📌 Interview Tips

1Duplicate scenario: Mobile app sends purchase_completed event, network times out after 3 seconds, app retries, server receives event twice with same event_id. Deduplication layer checks recent event ID cache (72 hour window), drops second occurrence, preventing revenue inflation.

2Late event: User completes purchase on device with clock running 10 minutes fast. Event arrives with timestamp 10:12 but actual arrival time is 10:06. Watermark is at 10:08 (current time minus 10 minute grace period). When event with timestamp 09:55 arrives at 10:20, it's past the watermark and flagged as late. System either drops it or triggers recomputation of affected hourly aggregates.

← Back to Event Data Modeling Overview