Loading...
Data Pipelines & OrchestrationDAG-based Orchestration (Airflow, Prefect)Medium⏱️ ~3 min

How DAG Orchestrators Execute Tasks

The Architecture: DAG orchestrators follow a control plane plus execution plane pattern. The control plane consists of a scheduler (decides what runs when), a metadata store (tracks task states and history), and a UI or API (for monitoring and manual triggers). The execution plane consists of worker processes that pull tasks from a queue and execute your code. Here's the execution flow in detail:
1
Scheduler reads DAG definitions: Every few seconds, the scheduler scans for DAGs and evaluates which tasks are ready to run based on schedule ("run daily at 1:00 AM") and dependency completion.
2
Tasks enqueued: Ready tasks are placed in a task queue (often Redis or RabbitMQ). At moderate scale, a single scheduler node can enqueue 1,000 to 3,000 tasks per minute with subsecond latency.
3
Workers execute: Worker processes pull tasks, run your code, and send heartbeats every 10 to 30 seconds to prove they're alive.
4
State tracking: Every state change (queued, running, success, failed, retry) is written to the metadata database, creating an audit trail.
The Metadata Store Under Load: This database is on the critical path. Consider a large deployment running 100,000 tasks per day. Each task changes state 5 to 10 times (scheduled, queued, running, success/failed, plus potential retries). That's 500,000 to 1,000,000 writes per day, or roughly 6 to 12 writes per second average, with peak loads 3x to 5x higher during batch windows. To hit 99.9% availability, teams typically run the metadata store on a replicated PostgreSQL or MySQL instance with synchronous replication within a region. The write amplification matters: if your orchestrator manages 50 active DAGs with 2,000 total tasks, and you query task history frequently in the UI, you need to index on dag_id, task_id, execution_date, and state to keep query latency under 100ms.
Scheduler Throughput
3,000
TASKS/MIN
<1s
ENQUEUE TIME
Concurrency Controls: Orchestrators enforce limits at multiple levels. At the DAG level, you might set max_active_runs=1 to ensure only one instance of a pipeline runs at a time, preventing data races when multiple runs write to the same tables. At the task level, you define pools that limit concurrent external API calls to respect rate limits (for example, max 100 concurrent tasks calling a third party API limited to 100 queries per second).
💡 Key Takeaways
The scheduler evaluates DAG readiness every few seconds and can enqueue 1,000 to 3,000 tasks per minute on a single node
Metadata store handles 6 to 12 writes per second on average for 100,000 daily tasks, with 3x to 5x peaks during batch windows requiring indexed queries under 100ms
Workers send heartbeats every 10 to 30 seconds; missed heartbeats trigger automatic task rescheduling to handle worker crashes
Concurrency controls prevent resource exhaustion and rate limit violations through DAG level max runs and task level pool limits
📌 Examples
1Airflow deployment at Airbnb: tens of thousands of tasks per day coordinated through central metadata database with distributed Celery executors for horizontal worker scaling
2Task pool limiting external API calls: setting pool size to 100 ensures your pipeline respects third party rate limit of 100 QPS even when 500 tasks are ready to run
← Back to DAG-based Orchestration (Airflow, Prefect) Overview
Loading...
How DAG Orchestrators Execute Tasks | DAG-based Orchestration (Airflow, Prefect) - System Overflow