Feature Engineering & Feature StoresFeature Sharing & DiscoveryMedium⏱️ ~3 min

Feature Sharing & Discovery: The Dual-Plane Architecture

Definition
Feature sharing and discovery enables hundreds of ML models across different teams to access thousands of features consistently without rebuilding them. The architecture uses a dual plane design: an offline plane for training data and a registry that acts as the central nervous system for metadata and discovery.

The Offline Plane

Computes and stores historical feature values in a data lake or warehouse (Hive, Delta Lake, BigQuery). Batch jobs materialize feature tables partitioned by entity and date, supporting point in time joins for training dataset generation. Throughput is the priority: scanning terabytes for a training job should complete in minutes to hours, not days.

The Online Plane

Serves features at inference time with strict latency requirements. Key value stores (Redis, DynamoDB, Cassandra) provide sub 10ms p95 lookups. Streaming jobs continuously update online values from event streams. The online plane trades storage cost for latency: keeping features hot in memory costs 10 to 50x more per GB than offline cold storage.

The Registry

Catalogs every feature with schema, owner, lineage, freshness SLA, and quality metrics. Serves as the single source of truth for discovery, enabling teams to search and evaluate candidate features before integration. Without a registry, teams reinvent features that already exist or use inconsistent definitions.

Synchronization Challenge

The offline and online planes must stay synchronized. A feature definition change must propagate to both planes atomically, or training serving skew emerges. Feature stores enforce this through versioned feature groups that materialize to both stores from the same transformation logic.

💡 Key Takeaways
Dual plane architecture separates offline training (TB scale, point in time correct) from online serving (5 to 20ms p95, 10K to 1M QPS) with registry enforcing consistency across both planes
Feature registry is not just storage but active governance: ranks by usage and quality, surfaces null rates and drift scores, enforces training serving parity to prevent skew
Production systems achieve 30 to 70 percent reuse rates, cutting model onboarding from weeks to days at Netflix, Uber, LinkedIn, and Airbnb
Online serving constraint drives architecture: single digit to low tens of milliseconds p95 latency within sub 100ms end to end inference budgets requires pre materialization and aggressive caching
Point in time correctness is mandatory: offline joins use event timestamps to prevent data leakage, same transformation logic in batch and streaming paths prevents silent accuracy drops
Scale envelope: thousands of features, hundreds of models, millions of events per minute for streaming updates, multi month historical backfills at TB to PB scale
📌 Interview Tips
1Netflix Zipline manages thousands of features used by hundreds of personalization models, processes daily TB scale training sets with multi month backfills, maintains single digit to low tens of milliseconds p95 for online retrieval
2Uber Michelangelo ingests millions of events per minute for ETA and pricing models, achieves 5 to 20ms p95 online lookups, generates multi TB training sets with point in time joins to prevent leakage
3LinkedIn Feathr reduces time to production from weeks to days by ranking features by usage frequency and model performance attribution, integrates with Venice for single digit millisecond online reads
4Airbnb Bighead targets sub 100ms end to end inference for search ranking, allocates low tens of milliseconds p95 to feature retrieval via pre materialized stores and request coalescing
← Back to Feature Sharing & Discovery Overview
Feature Sharing & Discovery: The Dual-Plane Architecture | Feature Sharing & Discovery - System Overflow