Graph Database Failure Modes and Operational Challenges
Unbounded Traversal Failures
Unbounded or cyclic traversals are a primary failure mode. Variable-length pattern matches on loopy graphs can explode combinatorially, visiting millions of nodes when you expected hundreds. A query for "all paths between two nodes" in a densely connected graph without depth limits can run for hours, exhausting memory and blocking other queries.
Mitigations: strict depth limits (max 4 hops), cycle detection with path uniqueness constraints, and query budgets that kill traversals after visiting 10,000 nodes or consuming 5 seconds.
Cross-Shard Degradation
Cross-shard traversals degrade predictably under load. A 2-hop query crossing 3 shards requires sequential round trips: fetch Shard 1 (20ms), network hop (10ms), fetch Shard 2 (20ms), merge (10ms) = 60ms minimum. Under load, queuing delays compound, pushing p99 (99th percentile latency: the slowest 1% of requests) from 60ms to 500ms+.
Consistency Anomalies
Asynchronous (non-blocking) replication causes subtle bugs. User updates privacy settings (removing an edge) on primary, but replicas lag by 10 seconds. Downstream services reading from replicas see stale neighbor lists, potentially granting incorrect access.
Mitigations: read-your-writes guarantee (after writing, subsequent reads from the same session see the written data, even if it means routing to primary) for sensitive operations. Also: bounded staleness windows and rate limiting on high-churn subgraphs.