Failure Modes and Edge Cases in dbt Workflows

Silent Data Corruption:
The most dangerous failure in a dbt workflow is when models run successfully from an infrastructure perspective but produce wrong results. The scheduler reports success, tests pass if they are too weak, and incorrect data propagates downstream. In a system with 500 models, a single wrong aggregation in a core intermediate model can corrupt hundreds of dashboards and dozens of downstream models.

Real world example: an incorrect revenue calculation that double counts refunds runs for 6 hours before detection. During that time, executives make decisions on inflated revenue metrics, customer success teams use wrong churn numbers, and finance reports incorrect figures. The blast radius extends to every model and dashboard referencing that core revenue logic.

❗ Remember: Tests are not optional at scale. Every model touching money, user counts, or critical business metrics needs uniqueness tests, non null checks, and business logic assertions. Treat failing tests as deployment blockers.
Schema Drift from Sources:
Upstream ingestion pipelines evolve independently. When a source system adds, renames, or removes columns, staging models may fail or worse, silently produce partial results. The trade off is between strict contracts and resilience. Failing fast on any schema mismatch improves correctness but reduces availability. If your ingestion adds a new column, should dbt models fail or pass it through?

At 10x scale with dozens of source systems, you must have explicit conventions. Some teams version their staging models and run schema change detection in CI. Others use permissive patterns that allow new columns but alert on removals. The choice depends on how often schemas change and how critical freshness is.

Performance Degradation and Cost Spikes:
A model that normally scans 100 GB and completes in 2 minutes suddenly starts scanning 1 TB after a join logic change. Runtime jumps to 20 minutes, warehouse costs spike by 10x, and downstream models miss freshness SLOs. This cascades: models that depend on the slow one queue up, and the entire pipeline that normally finishes in 15 minutes at p95 now takes 45 minutes at p95.

Performance Regression Impact
NORMAL
2 min
→
BAD FILTER
20 min
→
PIPELINE
45 min

Mitigation requires monitoring runtime distributions. Alert when p95 runtime for any model exceeds its baseline by 2x. Catch regressions in staging before they hit prod. Some teams run cost estimation in CI, failing builds if query costs exceed thresholds.

Backfill and Partial Run Challenges:
When you need to recompute months of historical data after a logic fix, naive full refreshes can scan tens of terabytes and cost thousands of dollars. For a fact table with 10 billion rows spanning 2 years, a full rebuild might take 8 hours and scan 5 TB per run. Incremental models help, but backfilling requires careful partition aware logic to avoid reprocessing everything.

Concurrency is another edge case. Running multiple overlapping dbt jobs that write to the same tables creates race conditions. On warehouses with strict concurrency limits, you might hit lock contention or serialization errors. Teams solve this by time boxing runs (only one prod run per hour window) or using separate target schemas for concurrent dev work.

💡 Key Takeaways

✓Silent data corruption is the highest risk failure: models run successfully but produce wrong results, propagating errors to hundreds of downstream models and dashboards for hours

✓Schema drift from upstream sources creates a trade off between strict contracts (fail fast, higher correctness) and permissive patterns (resilient, risk partial results)

✓Performance regressions where a model's runtime jumps from 2 minutes to 20 minutes cascade through the DAG, causing the entire pipeline to miss freshness SLOs by 3x

✓Backfilling historical data after logic changes requires partition aware strategies; naive full refreshes can scan tens of terabytes costing thousands of dollars and taking 8+ hours

✓Concurrency issues arise when multiple overlapping dbt jobs write to the same tables, causing race conditions or lock contention on warehouses with strict limits

📌 Interview Tips

1A core revenue model incorrectly double counts refunds due to a join logic bug. It runs for 6 hours, corrupting 150 downstream models and 200 dashboards before detection. Finance reports inflated quarterly numbers, requiring public correction.

2An upstream API adds a new <code>user_segment</code> column without notice. Staging models fail with schema mismatch errors, halting the entire pipeline. The team chooses to fail fast rather than risk partial data, accepting 30 minutes of downtime while they deploy a fix.

3After fixing calculation logic, a team backfills 18 months of data. The naive approach would scan 8 TB per run for 2 hours daily over a week. Instead, they use incremental backfills with <code>start_date</code> and <code>end_date</code> filters, processing one month at a time and completing in 3 days at 10x lower cost.

← Back to dbt Transformation Workflow Overview