Orchestration, Backfills, and Failure Handling: DAGs, Retries, and Priority Queues

Orchestration coordinates data pipeline workflows as Directed Acyclic Graphs (DAGs) with explicit dependencies, retries, and failure handling. At scale, the challenge is managing thousands of concurrent tasks, backfill storms, and ensuring fresh data takes priority over historical reprocessing.

Model each pipeline as a DAG where nodes are tasks (extract, transform, load) and edges represent data dependencies. Tasks are idempotent and checkpointed: they record source offsets, watermark state, and versioned input datasets so retries do not cause duplicate writes or skip data. Large organizations like Airbnb publicly report tens of thousands of daily task runs and thousands of concurrent tasks, requiring robust scheduling, priority queues, and resource quotas.

Backfills are a common failure mode. Reprocessing months of historical data can overwhelm real-time pipelines, causing cascading lag. Separate backfill jobs into a lower priority service class with concurrency limits and throttled write rates. For example, Amazon teams throttle backfills to 100 to 200 megabytes per second per table while reserving higher throughput for fresh data to protect customer-facing Service Level Agreements (SLAs). Implement a recompute window control that bounds how far back regular jobs scan (e.g., rolling 48 hours) with periodic deep compactions running offline.

Retries need exponential backoff and jitter to avoid thundering herds when upstream failures recover. Implement circuit breakers that trip after N consecutive failures and half-open after a cooldown period. Monitor DAG completion time, task failure rates, and queue depth. Alert on anomalies like sudden spikes in retry rates or long-tail tasks blocking downstream dependencies. Use lineage tracking to record which upstream dataset versions produced which downstream artifacts, enabling root cause analysis and targeted reruns when bugs are discovered.

💡 Key Takeaways

✓Model pipelines as Directed Acyclic Graphs (DAGs) with idempotent, checkpointed tasks recording source offsets and watermark state for safe retries.

✓Large enterprises run tens of thousands of daily tasks and thousands concurrently (Airbnb reference), requiring priority queues, resource quotas, and scheduling coordination.

✓Separate backfills into lower priority classes with throttled write rates (e.g., 100 to 200 MB/s per table) to avoid overwhelming real-time pipelines during large reprocessing jobs.

✓Implement exponential backoff, jitter, and circuit breakers for retries. Monitor DAG completion time, task failure rates, and queue depth; alert on anomalies.

✓Use lineage tracking to record dataset versions and dependencies, enabling root cause analysis and targeted reruns when bugs or data quality issues are discovered.

📌 Interview Tips

1Amazon backfill throttling: during a 6 month reprocess, limit writes to 150 MB/s per table while fresh micro-batches use up to 500 MB/s, protecting customer-facing dashboard SLAs.

2Circuit breaker: after 5 consecutive task failures, trip to OPEN state for 10 minutes. After cooldown, enter HALF_OPEN and attempt one retry. On success, return to CLOSED; on failure, return to OPEN for exponentially longer cooldown.

← Back to ETL Pipelines & Data Integration Overview