Database DesignDocument Databases (MongoDB, Firestore)Hard⏱️ ~3 min

Document Database Failure Modes and Edge Cases

Common Failure Modes

Even well-designed document database systems encounter failure modes under production load. Understanding these edge cases prevents outages. Hot partitions, per-document write limits, index fan-out, and unbounded growth are the most common culprits.

Hot Partitions

A shard key that does not account for skew creates hot partitions. A viral item or popular user concentrates traffic on one shard. Symptoms: P99 latency jumps from 10ms to 500ms for queries hitting that shard while others remain healthy. Solution: redesign shard key with additional distribution component or use sharded counters.

Per-Document Write Limits

Some systems enforce per-document write limits (approximately 1 write/second sustained). High-frequency counters (page views, likes) hitting one document throttle. Sharded counters split across 100-1,000 documents, each accepting increments, summed on read. This trades read latency (aggregate N shards) for write throughput (N times capacity).

Index Fan-Out

Large indexed arrays multiply write cost. A document with 1,000 indexed tags generates 1,000 index entries per write. Updating takes 500ms instead of 5ms. Keep indexed arrays bounded under 100 elements or move to separate collections.

Unbounded Document Growth

Document databases have hard size limits (16 MiB for some, 1 MiB for others). Embedding unbounded arrays (comments, logs) eventually hits the limit, causing write failures. Symptom: writes succeed for months then suddenly fail. Solution: split into parent document with summary and child collection for items.

Rollback Risk

Non-majority write concern risks rollbacks. A write acknowledged by the primary but not yet replicated is lost if the primary crashes before replication completes. The user sees data that then disappears. Use majority write concern for critical data despite the latency cost.

💡 Key Takeaways
Hot partitions from skewed access: P99 spikes from 10ms to 500ms while other shards idle
Per-document write limits (1/sec) require sharded counters for high-frequency updates
Index fan-out: 1,000 indexed array elements means 500ms writes instead of 5ms
Document size limits (1-16 MiB) cause sudden failures when unbounded arrays grow
Non-majority writes risk rollback: data appears then disappears after primary crash
📌 Interview Tips
1Identify unbounded growth risks in schema: embedded comments, logs, or event arrays
2Propose sharded counters when high-frequency updates to single document are needed
3Discuss rollback scenarios when choosing write concern for critical vs non-critical data
← Back to Document Databases (MongoDB, Firestore) Overview
Document Database Failure Modes and Edge Cases | Document Databases (MongoDB, Firestore) - System Overflow