Data Pipelines & Orchestration • DAG-based Orchestration (Airflow, Prefect)Easy⏱️ ~3 min
What is DAG-based Orchestration?
Definition
DAG-based orchestration is a system for coordinating interdependent data tasks using a Directed Acyclic Graph (DAG), where nodes represent tasks and edges represent dependencies, ensuring tasks run in the correct order without circular loops.
💡 Key Takeaways
✓A DAG (Directed Acyclic Graph) models workflow as nodes (tasks) and edges (dependencies), guaranteeing no circular dependencies that could cause infinite loops
✓Orchestrators separate task logic from execution mechanics: you write what to do, the system handles when, where, retries, and monitoring
✓Typical retry policies attempt 3 to 5 retries with exponential backoff (1 min, 2 min, 4 min) before marking a task as failed
✓The acyclic property is enforced at DAG definition time, preventing deployment of workflows with circular dependencies
📌 Interview Tips
1Daily ETL pipeline: Ingest customer data from 3 sources in parallel (10 minutes each), transform and join the data (30 minutes), then publish to data warehouse and update BI dashboards
2ML training workflow: Extract features from event logs, preprocess data in parallel for 5 models, train each model concurrently, evaluate results, and deploy the best performing model