Database DesignIndexing StrategiesMedium⏱️ ~2 min

Why Do Composite Index Column Ordering and Selectivity Matter?

Composite indexes combine multiple columns into a single index structure, with column order determining which queries can leverage the index. B+ tree indexes support leftmost prefix matching: an index on (A, B, C) can serve queries filtering on A, (A, B), or (A, B, C), but not on B alone or C alone. The leading columns must be present in the query predicate for the index to be useful. This means column order directly controls query coverage and performance. Selectivity drives ordering decisions. Place the most selective column first to minimize the number of rows scanned. If column A filters to 1 percent of rows and column B filters to 50 percent, leading with A means the index scan touches 1 percent of leaf pages, then narrows further with B. Reversing the order would scan 50 percent of pages before applying A. For a 100 million row table, the difference is 1 million rows scanned versus 50 million rows scanned. Query optimizers use statistics to estimate selectivity, but skewed distributions or outdated stats can lead to poor plans. Equality predicates should precede range predicates in composite indexes. An index on (Status, CreatedDate) works well for queries filtering Status equals Active and CreatedDate between two timestamps because the index navigates to the Active section then scans the date range sequentially. Reversing to (CreatedDate, Status) forces a large date range scan followed by filtering on Status, touching many irrelevant rows. The tradeoff is maintenance cost and diminishing returns. Each additional column in a composite index increases index width, reducing fan out and increasing tree height slightly. More importantly, the index only helps queries that match the leftmost prefix pattern. A single index on (A, B, C) does not help queries filtering only B or only C, requiring separate indexes and multiplying write amplification. Production systems balance coverage versus write cost by creating 2 to 4 composite indexes per table, carefully chosen based on query frequency and selectivity measurements from query logs.
💡 Key Takeaways
Leftmost prefix rule: index on (A, B, C) serves queries on A, (A, B), (A, B, C) but not B alone or C alone, requiring column presence in predicate order
Leading with high selectivity column (1 percent filter) scans 1 million rows versus leading with low selectivity column (50 percent filter) scanning 50 million rows on 100 million row table
Equality predicates before range predicates: (Status='Active', Date>X) index navigates to Status section then scans date range versus (Date>X, Status='Active') scanning large date range then filtering
Each additional composite index column increases entry size by 4 to 16 bytes reducing fan out from 400 to 300 and adding one random write per insert/update
Skewed distributions break selectivity assumptions: a Status column with 99 percent Active rows makes Status leading position nearly useless despite appearing selective in schema
📌 Examples
Google Cloud Spanner: Composite index on (ShardKey, Timestamp, UserID) for time series data enables efficient range scans within shard while supporting exact UserID lookups via leftmost prefix
PostgreSQL: Index on (tenant_id, created_at DESC) for multi tenant SaaS application supports tenant isolation and recent item queries; reversing order would require full index scan per tenant
Microsoft SQL Server: Query filtering WHERE Region='West' AND Status='Active' AND Date>'2024-01-01' with index on (Region, Status, Date) scans 1K rows versus index on (Date, Status, Region) scanning 10M rows then filtering
← Back to Indexing Strategies Overview