Loading...
Data Pipelines & Orchestration • Cross-Pipeline Dependency ManagementMedium⏱️ ~2 min
Production Scale: Metadata Stores and Dataset Tracking
The Metadata Architecture:
At scale, you need a centralized metadata store that tracks pipeline runs, dataset states, and dependency relationships. This becomes the source of truth for the control plane. For each pipeline run, you record pipeline ID, run ID, partition key (like date or region), status (RUNNING, SUCCESS, FAILED), timestamps, row counts, quality metrics, and upstream lineage.
For each dataset or table, you track partitions and versions. When an upstream pipeline completes, it follows a heartbeat then finalize pattern. First, mark the partition as
Real World Scale Numbers:
A large Airflow deployment at a company like Uber or Airbnb manages 500 to 2,000 DAGs with 10,000 to 50,000 tasks executing per day. Typical SLA expectations are p50 task start latency under 30 seconds after dependency satisfaction, p99 under 2 minutes, with end to end pipeline SLAs like "recommendation features ready by 2:00 AM with 99.5% success rate".
Netflix has documented processing terabytes of data daily across hundreds of interdependent pipelines. Their metadata layer tracks millions of partition states, with updates happening at 1,000 to 10,000 events per second during peak hours. The metadata store must handle high write throughput while supporting low latency reads for dependency checks.
File Based Markers:
A common pattern for storage layer coordination is writing a small success marker file like
status=RUNNING with a timestamp. Then, after data is fully written and validated, atomically update to status=SUCCESS and publish an event. This prevents downstream jobs from starting on partially written data.
1
Write Start Signal: Pipeline begins writing partition and records
status=RUNNING with start_time in metadata store.2
Data Quality Checks: After write completes, validate row counts, null rates, schema conformance against expected thresholds.
3
Atomic Finalize: Update metadata to
status=SUCCESS, row_count=3.2B, quality_passed=true, and publish event to trigger downstream.Typical Production Metrics
2,000
DAGs
50K
TASKS/DAY
99.5%
SLA SUCCESS
_SUCCESS or .DONE to indicate all data files for a partition are complete. Downstream jobs check for this marker instead of scanning directories or listing files, avoiding race conditions where data is still being written.
For example, an ingestion job writes 10,000 Parquet files to s3://data/events/date=2025-12-24/ over 20 minutes. Only after all files are written does it write s3://data/events/date=2025-12-24/_SUCCESS. Downstream jobs poll or receive an event about the marker file, not the individual data files, ensuring they never read incomplete partitions.💡 Key Takeaways
✓Centralized metadata stores track millions of partition states with 1,000 to 10,000 updates per second during peak, requiring high write throughput and low read latency
✓Heartbeat then finalize pattern prevents downstream jobs from reading partial data: write start signal, validate quality, atomically mark SUCCESS and publish event
✓Production Airflow deployments manage 500 to 2,000 DAGs executing 10,000 to 50,000 tasks daily with p50 start latency under 30 seconds, p99 under 2 minutes
✓File based success markers like <code>_SUCCESS</code> eliminate race conditions by signaling partition completeness without requiring directory scans
✓SLA tracking is critical: platforms target 99.5% success rates with specific deadlines like "features ready by 2:00 AM" for business operations
📌 Examples
1Netflix metadata layer processes terabytes daily across hundreds of pipelines, tracking partition states with quality metrics like <code>row_count=3.2B</code> and <code>null_rate=0.02%</code>
2Spark jobs writing to S3 output <code>_SUCCESS</code> marker only after all tasks commit, preventing Hive queries from reading partial results during multi hour writes
Loading...