Dimensional Modeling: Star and Snowflake Schemas

Dimensional modeling organizes data for analytical queries by separating measurements (facts) from descriptive context (dimensions). A fact table contains quantitative metrics at a specific grain, like individual order lines or page views, with foreign keys pointing to dimension tables that provide rich descriptive attributes. This structure makes business questions natural: slice revenue facts by customer dimension, dice by product and time dimensions, and drill down from category to SKU.

Star schema denormalizes dimensions into single wide tables. The customer dimension includes all attributes from name and email to city, state, and country in one table. Queries join the fact table to typically 3 to 8 dimension tables in a simple star pattern, and query optimizers handle these predictable joins efficiently. Amazon Redshift queries scanning 100 to 500 GB complete in 5 to 30 seconds on mid sized clusters when star schemas align with distribution keys. The trade off is storage: dimension attributes are duplicated across multiple dimension records when using Slowly Changing Dimension (SCD) Type 2 to track history.

Snowflake schema normalizes dimensions into related tables. Customer dimension splits into customer, city, state, and country tables linked by foreign keys, saving storage by eliminating duplication. However, queries now require more joins: fact to customer to city to state to country. Each additional join introduces risk of poor join strategy selection, data skew on join keys, and sensitivity to statistics freshness. In practice, the storage savings rarely justify the performance complexity for analytical workloads on modern columnar storage with aggressive compression.

Dimension surrogate keys are critical. Instead of joining facts to dimensions on natural business keys like customer email or product SKU (which can change), assign each dimension row a synthetic integer surrogate key. Facts reference these stable surrogates, making joins faster (integer equality) and enabling SCD Type 2 history tracking where multiple surrogate keys can represent the same business entity across time.

💡 Key Takeaways

✓Fact tables store measurements at a clearly defined grain (order line, page view, transaction) with foreign keys to dimensions and numeric metrics like revenue, quantity, duration

✓Star schema denormalizes dimensions into wide tables enabling 3 to 8 simple joins, completing queries scanning 100 to 500 GB in 5 to 30 seconds on optimized Amazon Redshift clusters

✓Snowflake schema normalizes dimensions to save storage but adds join complexity and sensitivity to query planner statistics, rarely justified given columnar compression already achieves 5 to 10x size reduction

✓Surrogate keys (synthetic integers) assigned to each dimension row enable faster integer joins and Slowly Changing Dimension (SCD) Type 2 history where multiple surrogates track one business entity over time

✓Wrong join strategy on star schema (broadcasting a large dimension instead of hash partitioning) can explode memory usage and spill to disk, pushing query latency from seconds to minutes

✓Conformed dimensions shared across multiple fact tables (customer, product, date with identical surrogate keys) enable drill across queries that combine metrics from different business processes

📌 Interview Tips

1Sales fact table with grain of one row per order line: columns include order_line_id, order_date_key (FK to date dimension), customer_key (FK to customer dimension), product_key (FK to product dimension), quantity, unit_price, revenue

2Customer dimension with surrogate key customer_key, business key customer_id, and SCD Type 2 tracking: customer_key 1001 and 1002 both reference customer_id ABC123, representing before and after address change with effective_from and effective_to dates

3Snowflake normalized customer dimension split into customer (customer_key, name, city_key FK), city (city_key, city_name, state_key FK), state (state_key, state_name, country_key FK), requiring 4 joins instead of 1

← Back to Data Warehousing Architecture Overview