Loading...
Data Pipelines & Orchestration • Cross-Pipeline Dependency ManagementMedium⏱️ ~3 min
How Dependency Resolution Works: Polling vs Event-Driven
The Coordination Challenge:
Once you have explicit dependencies declared, the orchestrator needs a mechanism to detect when upstream conditions are satisfied. Two patterns dominate: polling (pull) and event driven (push). Each trades off latency, resource consumption, and operational complexity.
Polling Pattern (Pull):
Downstream jobs use sensors that periodically query a metadata store to check if upstream conditions are met. For example, a sensor queries every 60 seconds: "Is
Event Driven Pattern (Push):
Upstream pipelines publish events when they complete. For instance, after writing and validating data, the pipeline publishes to a Kafka topic or cloud pub/sub:
dataset=user_events, partition=2025-12-24, status=SUCCESS?" When the condition is true, the downstream task starts.
The math matters here. With a 60 second polling interval, average latency to detect completion is 30 seconds (half the interval). If you have 10,000 concurrent sensors polling every minute, that is 167 queries per second against your metadata store. At FAANG scale with 500 to 2,000 DAGs running 10,000 to 50,000 tasks per day, this can overwhelm a single database.
Polling Performance Impact
30s
AVG LATENCY
167/s
QUERIES
{"dataset": "user_events", "partition": "2025-12-24", "status": "SUCCESS", "row_count": 3200000000}. The orchestrator subscribes to these events and immediately enqueues dependent tasks.
Latency drops dramatically. Typical production systems achieve p50 task start latency under 5 seconds and p99 under 30 seconds after upstream completion. Event throughput can reach 1,000 to 10,000 events per second for large platforms like Netflix or Uber.
However, you now need robust event handling. Events can be duplicated due to retries, so downstream must be idempotent. Network partitions or orchestrator restarts can cause missed events, requiring fallback polling or event replay. You also need exactly once or at least once semantics, which adds complexity.
✓ In Practice: Many production systems use hybrid approaches. Critical paths with tight SLAs use event driven triggering for sub 10 second latency. Less time sensitive pipelines use polling with longer intervals to reduce load. Airbnb and LinkedIn have documented similar architectures in their engineering blogs.
The Idempotency Requirement:
Regardless of pattern, pipelines must be safe to rerun for a given partition and version. Since events can duplicate and sensors can retry, you typically write to new locations or versions, then atomically swap pointers, or use upserts with deterministic transformations. Without this, any hiccup in signaling causes double counting or corrupted aggregates.💡 Key Takeaways
✓Polling queries metadata every N seconds with average detection latency of N/2, creating load on metadata stores at scale (167 queries per second for 10,000 sensors at 60 second intervals)
✓Event driven systems achieve p50 latency under 5 seconds by publishing completion events that immediately trigger downstream tasks, but require robust deduplication and replay logic
✓Production systems at Netflix and Uber handle 1,000 to 10,000 dependency events per second using pub/sub infrastructure like Kafka
✓Idempotency is mandatory: pipelines must safely rerun for the same partition and version since events duplicate and sensors retry on transient failures
✓Hybrid approaches are common: critical paths use event driven for sub 10 second latency, less urgent pipelines use polling to reduce infrastructure complexity
📌 Examples
1Airflow ExternalTaskSensor polls every 60 seconds to check if upstream DAG task succeeded, with poke_interval configurable to balance latency vs scheduler load
2Dagster asset sensors subscribe to S3 event notifications via SNS, triggering materialization within 2 to 5 seconds of upstream file writes with at least once delivery
Loading...