Document Database Failure Modes and Edge Cases
Common Failure Modes
Even well-designed document database systems encounter failure modes under production load. Understanding these edge cases prevents outages. Hot partitions, per-document write limits, index fan-out, and unbounded growth are the most common culprits.
Hot Partitions
A shard key that does not account for skew creates hot partitions. A viral item or popular user concentrates traffic on one shard. Symptoms: P99 latency jumps from 10ms to 500ms for queries hitting that shard while others remain healthy. Solution: redesign shard key with additional distribution component or use sharded counters.
Per-Document Write Limits
Some systems enforce per-document write limits (approximately 1 write/second sustained). High-frequency counters (page views, likes) hitting one document throttle. Sharded counters split across 100-1,000 documents, each accepting increments, summed on read. This trades read latency (aggregate N shards) for write throughput (N times capacity).
Index Fan-Out
Large indexed arrays multiply write cost. A document with 1,000 indexed tags generates 1,000 index entries per write. Updating takes 500ms instead of 5ms. Keep indexed arrays bounded under 100 elements or move to separate collections.
Unbounded Document Growth
Document databases have hard size limits (16 MiB for some, 1 MiB for others). Embedding unbounded arrays (comments, logs) eventually hits the limit, causing write failures. Symptom: writes succeed for months then suddenly fail. Solution: split into parent document with summary and child collection for items.
Rollback Risk
Non-majority write concern risks rollbacks. A write acknowledged by the primary but not yet replicated is lost if the primary crashes before replication completes. The user sees data that then disappears. Use majority write concern for critical data despite the latency cost.