Document Database Failure Modes and Edge Cases

Common Failure Modes
Even well-designed document database systems encounter failure modes under production load. Understanding these edge cases prevents outages. Hot partitions, per-document write limits, index fan-out, and unbounded growth are the most common culprits.
Hot Partitions
A shard key that does not account for skew creates hot partitions. A viral item or popular user concentrates traffic on one shard. Symptoms: P99 latency jumps from 10ms to 500ms for queries hitting that shard while others remain healthy. Solution: redesign shard key with additional distribution component or use sharded counters.
Per-Document Write Limits
Some systems enforce per-document write limits (approximately 1 write/second sustained). High-frequency counters (page views, likes) hitting one document throttle. Sharded counters split across 100-1,000 documents, each accepting increments, summed on read. This trades read latency (aggregate N shards) for write throughput (N times capacity).
Index Fan-Out
Large indexed arrays multiply write cost. A document with 1,000 indexed tags generates 1,000 index entries per write. Updating takes 500ms instead of 5ms. Keep indexed arrays bounded under 100 elements or move to separate collections.
Unbounded Document Growth
Document databases have hard size limits (16 MiB for some, 1 MiB for others). Embedding unbounded arrays (comments, logs) eventually hits the limit, causing write failures. Symptom: writes succeed for months then suddenly fail. Solution: split into parent document with summary and child collection for items.
Rollback Risk
Non-majority write concern risks rollbacks. A write acknowledged by the primary but not yet replicated is lost if the primary crashes before replication completes. The user sees data that then disappears. Use majority write concern for critical data despite the latency cost.

💡 Key Takeaways

✓Hot partitions from skewed access: P99 spikes from 10ms to 500ms while other shards idle

✓Per-document write limits (1/sec) require sharded counters for high-frequency updates

✓Index fan-out: 1,000 indexed array elements means 500ms writes instead of 5ms

✓Document size limits (1-16 MiB) cause sudden failures when unbounded arrays grow

✓Non-majority writes risk rollback: data appears then disappears after primary crash

📌 Interview Tips

1Identify unbounded growth risks in schema: embedded comments, logs, or event arrays

2Propose sharded counters when high-frequency updates to single document are needed

3Discuss rollback scenarios when choosing write concern for critical vs non-critical data

← Back to Document Databases (MongoDB, Firestore) Overview