Feature Engineering & Feature StoresPoint-in-Time CorrectnessHard⏱️ ~3 min

PIT Correctness Failure Modes and Edge Cases

Out of Order and Late Arriving Data

Point in Time (PIT) correctness fails in subtle ways that can go undetected for months while silently degrading model quality. Streaming systems deliver events late with p95 on time but p99 late by minutes to hours. If the system gates by processing time instead of event time, late arrivals overwrite history and contaminate training labels. A user action at 2pm arriving at 4pm appears available at 2pm in naive systems, leaking future information.

Backfill Contamination

Backfilling feature pipelines after bug fixes or schema changes can corrupt historical state. If the backfill uses current logic to recompute past values, features reflect information unavailable at original timestamps. The fix is immutable append only storage where backfills create new versions rather than overwriting, preserving the original buggy values alongside corrected ones for comparison.

Clock Skew

Distributed systems with unsynchronized clocks introduce systematic bias. If producer clocks run ahead, events appear available before they actually occurred. NTP synchronization to millisecond precision across all infrastructure is table stakes. Log both event time (from producer) and ingestion time (from consumer) to detect and correct skew during joins.

Window Boundary Races

Aggregates over time windows (sum over last 7 days) face boundary conditions. An event at exactly midnight on day boundary might be included or excluded depending on whether inequality is less than or less than or equal to. Inconsistent boundary handling between training and serving causes subtle feature drift.

Timezone Bugs

Features aggregated by calendar day must handle timezone consistently. A global user active across UTC day boundaries may have activity counted twice or missed entirely if training and serving use different timezone assumptions.

💡 Key Takeaways
Out of order data with p99 lateness of minutes to hours requires watermarks (for example, 24 hour late data acceptance) and retraction policies to prevent late arrivals from overwriting history incorrectly
Clock skew of seconds to minutes across distributed services creates 1 to 5 percent feature mismatches at window boundaries, requiring UTC enforcement and per entity monotonicity validation
Entity key churn from user or device merges contaminates feature vectors if not handled with explicit entity resolution history maintaining effective start and end times
Aggregation boundary bugs using inclusive versus exclusive semantics (t versus t minus 1 second) systematically inflate offline metrics 0.5 to 2 percent while degrading online serving
Partial replay failures in streaming consumers duplicate updates without idempotency keys and event time conflict resolution, inflating aggregate features by 5 to 20 percent
Hard deletes for GDPR or data retention erase history needed for time travel training, requiring tombstones with effective times to maintain reproducibility while respecting deletion
📌 Interview Tips
1Ad click prediction system: p95 clicks arrive within 1 minute, p99 within 1 hour. Without watermarks, late clicks leak into past training examples causing 10 percent precision drop when detected and fixed
2Multi region service with 30 second clock skew: 7 day rolling window computed at boundary includes 30 seconds more data in one region, causing 2 percent feature value divergence detected in A/B test
3User account merge at Airbnb: user books from mobile app (entity A) and web (entity B), then merges accounts. Feature join must use entity resolution effective date to avoid cross contamination in historical training data
← Back to Point-in-Time Correctness Overview