Time Travel Storage Patterns for Feature Versioning
Time Travel Storage Concept
Time travel storage enables reconstructing feature state at any historical timestamp by maintaining immutable versioned histories. The core pattern combines base snapshots with append only change logs: each feature update creates a new version keyed by entity ID, event timestamp, and optionally a version counter for deduplication. This mirrors database point in time recovery but adapted for ML feature semantics with per entity timelines.
Copy on Write Architecture
Delta Lake and Apache Iceberg use copy on write semantics where updates create new file versions rather than modifying in place. A transaction log records which files are valid at each version. To read features as of timestamp T, the system resolves the log to find files committed before T, then applies filters within files using per row timestamps. This enables time travel queries like SELECT * FROM features VERSION AS OF timestamp.
Merge on Read Architecture
Apache Hudi optimizes for write heavy workloads using merge on read. Base files store snapshots, delta logs capture changes, and compaction periodically merges deltas. Reads must merge base plus deltas, adding query overhead but reducing write amplification. For high churn features (updated per request), merge on read cuts storage costs 2 to 5x versus copy on write.
Retention and Compaction
Historical versions consume 1.5 to 3x storage of current state. Production systems balance retention windows (7 to 90 days typical) against storage cost. Aggressive compaction reduces file count but limits time travel range. The sweet spot depends on training cadence: if you retrain weekly, 14 day retention suffices; quarterly audits may require 90 day retention.
Netflix Scale
Netflix uses snapshot based time travel on petabyte scale tables to rebuild exact historical training datasets months later for audits and model rollbacks.