Normalization vs Denormalization: Core Definitions and Trade-offs

Definition
Normalization organizes data into minimal, logically independent tables linked by foreign keys so each fact is stored exactly once. Denormalization deliberately duplicates data or precomputes results to eliminate joins at read time.
Normalized Design
Normalization splits data into separate tables: users, posts, comments, each with its own primary key. When displaying a post with author info, you join tables at query time. The advantage is write efficiency: updating a username touches one row in the users table, and all queries automatically reflect the change. Storage is minimal because nothing is duplicated. A normalized schema for 250 million users with posts and relationships might total 20 TB.
Denormalized Design
Denormalization embeds or duplicates the data a query needs directly in the row it reads. Instead of joining users + posts + comments to render a feed item, you store author_name, post_text, and comment_count together in a precomputed feed row. Reads become single-row lookups returning in 1-5 ms from cache. The cost: every source change must propagate to all denormalized copies. If a user has 300 followers, one post creates 300 feed row writes.
The Fundamental Trade-off
The core trade-off is read latency versus write complexity. Normalized models keep writes cheap (typically < 5 ms for a single row) but reads may require multiple joins. If each cross-shard lookup adds 5-10 ms, five joins push your p50 latency to 25-50 ms and p99 above 200 ms under load. Denormalized reads hit < 10 ms consistently but writes fan out proportionally to relationship count.
Hybrid Reality
Most large systems use both. The normalized store is the source of truth for correctness. Denormalized projections serve reads and are regenerated from the source via change streams (real-time feeds of database changes). The normalized data guarantees consistency; the denormalized views provide performance. This separation lets you optimize each path independently.

💡 Key Takeaways

✓Normalization stores each fact once, optimizes for write efficiency (single 5ms update) and correctness, but reads requiring joins across shards can add 5 to 10 milliseconds per hop, pushing p99 latency above 200 milliseconds

✓Denormalization duplicates data along access paths to eliminate joins, achieving single digit millisecond read latency from cache, but write amplification scales with fan-out (300 followers means 300 writes per post)

✓Storage cost multiplier for denormalization is typically 3 to 10 times the normalized size once you include replicas and indexes; at $20 to $50 per terabyte per month for SSD, a 50 terabyte denormalized store costs $1,000 to $2,500 monthly per replica

✓Production systems treat normalized data as source of truth and denormalized projections as derived products regenerated via change data capture streams with eventual consistency

✓The decision hinges on access pattern measurements: normalize for write heavy workloads with strict invariants (financial transactions), denormalize for read heavy endpoints with tight latency SLOs (feeds serving 90% plus reads with p99 under 1 second)

📌 Interview Tips

1Meta social graph: normalized users, posts, edges as source of truth; denormalized per user feed rows with embedded author name, post snippet, ranking features to serve 250 million users with 500 items each (25 terabytes per replica) achieving p50 under 100 to 200 milliseconds

2Pinterest homefeed: normalized pin, board, user entities; denormalized homefeed index per user with precomputed ranking features serves 400 million monthly active users with 72 billion daily reads at 98% cache hit rate, keeping origin queries under 14,000 per second globally

← Back to Normalization vs Denormalization Overview