How dbt Model Dependencies and Execution Work

The Model DAG Concept:
Every dbt model is a SQL SELECT statement that produces a table or view. The magic happens when models reference other models using the ref() function. When you write SELECT * FROM {{ ref('staging_orders') }}, you declare an explicit dependency. dbt compiles these references into a directed acyclic graph representing the entire transformation pipeline.

This graph determines execution order. If Model C references Models A and B, dbt guarantees A and B complete before C starts. The system can parallelize independent branches. With 300 models, you might have 8 to 16 running concurrently at any time, limited by warehouse capacity and concurrency settings.

Execution Parallelism Impact
45 min
SERIAL
8 min
16 PARALLEL
Materialization Strategies:
Each model chooses how it materializes in the warehouse. Views are lightweight, computed on each query, perfect for simple transforms or rarely accessed models. Tables store results physically, trading storage for query speed. A fact table with 10 billion rows might take 2 hours to rebuild as a full table refresh, but queries run in seconds instead of minutes.

Incremental models are where it gets interesting. Instead of rebuilding everything, they append or overwrite only new data. For a table growing by 50 million rows daily, an incremental run scans only that day's partition and completes in 5 to 10 minutes instead of hours. The trade off is complexity: you manage merge logic and partition boundaries.

Real Execution Example:
Consider a pipeline with three layers. First, 20 staging models clean raw sources, each taking 30 to 90 seconds. These run in parallel since they have no interdependencies. Next, 50 intermediate models apply business logic, depending on staging outputs. These execute in waves as dependencies resolve, with 12 to 16 running concurrently. Finally, 10 mart models aggregate for dashboards, each scanning billions of rows and taking 3 to 8 minutes.

Total wall time for this 80 model pipeline might be 12 minutes at p50. Serial execution would take over 2 hours. The DAG structure and parallelization make this feasible for hourly or even 15 minute refresh schedules.

💡 Key Takeaways

✓The <code>ref()</code> function creates explicit dependencies between models, compiling into a directed acyclic graph that determines execution order and enables parallelization

✓Materialization choices (view, table, incremental) trade query speed against build time and storage, with incremental models reducing daily refreshes from hours to minutes for large tables

✓Parallel execution with 8 to 16 concurrent models reduces total pipeline runtime from potential hours to 10 to 15 minutes, enabling hourly or sub hourly refresh schedules

✓For a 10 billion row fact table growing by 50 million rows daily, incremental materialization scans only new partitions, completing in 5 to 10 minutes versus 2 hours for full refresh

✓The DAG enforces dependencies automatically: if Model C needs A and B, dbt guarantees A and B finish first, eliminating manual orchestration and race conditions

📌 Interview Tips

1A staging model <code>staging_events</code> cleans raw clickstream data in 60 seconds. Ten intermediate models reference it via <code>{{ ref('staging_events') }}</code>, each calculating session metrics. These run in parallel once staging completes, finishing the entire branch in 90 seconds instead of 15 minutes serial.

2An incremental fact table appends only today's partition using <code>is_incremental()</code> logic. On Monday morning, it processes Sunday's 50 million events in 8 minutes. Without incremental logic, scanning all 10 billion rows would take 2 hours and cost 10x more in warehouse compute.

← Back to dbt Transformation Workflow Overview