Partitioning & Sharding • Secondary Indexes with PartitioningHard⏱️ ~3 min
What Are the Failure Modes and Edge Cases of Secondary Indexes in Partitioned Systems?
Secondary indexes in partitioned systems introduce multiple failure modes that can violate application correctness or degrade performance under edge conditions. First, eventual consistency in global indexes causes index lag, where base writes complete but index updates have not yet propagated. During lag, queries return false negatives: a record exists and matches the predicate, but the index does not yet point to it. Under normal load, DynamoDB GSI lag is milliseconds, but during traffic spikes, backfills, or partition hotspots, lag can extend to seconds or minutes. Applications must tolerate missing results and consider retry logic or reconciliation jobs to detect and repair inconsistencies.
Second, hot term skew in global indexes concentrates reads and writes on a single index partition, causing throttling and latency spikes even when aggregate capacity is high. For example, if status equals active represents 90 percent of records and receives frequent queries, that single index partition becomes a bottleneck. Mitigation requires salting the term into B buckets by including a hash mod B in the index key, but this forces queries to fan out to all B buckets, reintroducing some scatter/gather overhead. Write races and reordering are another edge case: without distributed transactions, updates to the base record and index can arrive out of order at different partitions, causing transient inconsistencies. Using versioning (e.g., timestamps or sequence numbers) and last writer wins semantics at the index helps, but applications must handle cases where an index entry points to a base record that no longer matches the indexed predicate.
Fan out amplification in local indexes creates severe tail latency problems as partition count grows. With N partitions, the probability of hitting at least one slow shard is approximately 1 minus (1 minus p)^N, where p is the per shard tail latency percentile. For 100 partitions with per partition p99 equals 1 percent (each partition exceeds 100 milliseconds 1 percent of the time), the overall query exceeds 100 milliseconds roughly 63 percent of the time, degrading application p99 SLOs. Systems mitigate with hedged requests (send duplicate requests to replica shards after a timeout) or top K early termination, but these add complexity and load. Large item collections in systems like DynamoDB LSIs impose per partition key limits (10 GB), causing writes to fail when the limit is hit, often discovered only in production after the partition key has accumulated too many items. Reindex and backfill operations consume significant write capacity and can take hours on terabyte scale tables, during which new index queries return incomplete results and latency may spike across the entire system.
💡 Key Takeaways
•Index lag in global indexes causes false negatives during the lag window. DynamoDB GSI lag is typically 100 to 500 milliseconds but can reach seconds during spikes or backfills. Critical flows must handle missing items, retry with exponential backoff, or reconcile via batch jobs.
•Hot term skew on a global index partition creates a throughput bottleneck. If 80 percent of queries hit country equals US, that single index partition throttles at its capacity limit even when total provisioned capacity is 10 times higher. Salting into B buckets trades read fan out for write distribution.
•Fan out amplification in local indexes degrades tail latency exponentially with partition count. At 50 partitions with per shard p99 equals 20 milliseconds, coordinator p99 often exceeds 60 milliseconds; at 100 partitions, p99 can reach 100 to 200 milliseconds due to stragglers.
•Write races cause transient inconsistencies. A base record update and its index update can arrive out of order at their respective partitions. Applications must use versioning and tolerate stale index entries that point to base records no longer matching the predicate.
•DynamoDB LSI item collection limit of 10 gigabytes per partition key makes LSIs unusable for high fan out keys like status or category. Hitting the limit causes write failures; mitigation requires repartitioning the table or switching to GSIs.
•Backfilling a new global index on a 5 terabyte table with 10,000 write capacity units can take 4 to 8 hours. During backfill, index queries miss older items, and latency may spike. Production rollouts must account for partial results and throttle backfill to avoid impacting online traffic.
📌 Examples
Index lag scenario: An order service writes order_id equals 12345 with status equals pending at time T. At T plus 200 milliseconds, a query on the status GSI for status equals pending does not return order 12345 because the index update has not propagated. The client retries at T plus 600 milliseconds and sees the order. Application logic must tolerate this delay for non critical flows.
Hot term example: A social media platform indexes posts by visibility with 95 percent public, 5 percent private. The public term index partition receives 95 percent of query load and throttles at 3,000 queries per second (QPS) despite table capacity of 30,000 QPS. Engineers salt visibility into 10 buckets: index key equals hash(public) mod 10, public, post_id. Queries now fan to 10 partitions at 300 QPS each, eliminating throttling but increasing query latency from 8 milliseconds to 20 milliseconds.
Fan out tail latency: A 100 shard Elasticsearch cluster with per shard p99 equals 30 milliseconds serves a query that fans to all shards. The coordinator p99 exceeds 80 milliseconds due to the max of N effect. Reducing shard count to 30 and increasing shard size drops coordinator p99 to 40 milliseconds.
Backfill impact: Adding a DynamoDB GSI on a 3 terabyte table takes 5 hours. During hour 2, a spike in base write traffic increases GSI lag from 200 milliseconds to 3 seconds, causing a spike in false negative query results. Dashboards show index lag metric jump, triggering alerts and requiring temporary capacity increase to drain the backlog.