Latency vs Cost Trade-offs in Feature Storage
Cost Differential
Online feature stores deliver millisecond latency through high availability in memory or SSD optimized databases like Redis, DynamoDB, or Cassandra, but this performance comes at 10 to 50x higher cost per gigabyte month compared to offline object storage like S3 or data lakes. A production recommendation system might pay $50 per gigabyte month for Redis versus $1 per gigabyte month for S3, making the choice of which features live online a critical cost optimization decision.
Operational Complexity
Scales with online requirements. Multi region replication for 99.99% availability, automated failover, consistent hashing for sharding, and aggressive TTL policies to prevent unbounded growth all add engineering overhead. DoorDash reported managing 10,000+ QPS per service with burst handling and sub 10ms p99 latency requires sophisticated autoscaling, partition aware back pressure handling, and circuit breakers to fallback to default values during incidents.
Decision Framework
Centers on latency sensitivity versus feature cardinality. User facing ranking, fraud detection, and dynamic pricing need 5 to 50ms incremental latency budgets where online features materially affect CTR or conversion rates. In contrast, churn prediction, LTV modeling, and nightly batch recommendations can use offline only features since decisions occur outside request paths. Most production systems adopt a hybrid: 10 to 100 latency critical features online plus 100 to 1000 rich features precomputed offline and cached.
Cost Aware Design
Constrains online footprint through aggressive strategies. Netflix quantizes feature vectors to reduce memory, downsamples long tail entities with low request rates, and evicts stale entries via TTLs measured in hours to days. For features accessed less than once per hour per entity, the cache miss penalty of fetching from offline storage often beats the cost of maintaining online replicas across all regions.