Graph Database Failure Modes and Operational Challenges

Unbounded Traversal Failures
Unbounded or cyclic traversals are a primary failure mode. Variable-length pattern matches on loopy graphs can explode combinatorially, visiting millions of nodes when you expected hundreds. A query for "all paths between two nodes" in a densely connected graph without depth limits can run for hours, exhausting memory and blocking other queries.
Mitigations: strict depth limits (max 4 hops), cycle detection with path uniqueness constraints, and query budgets that kill traversals after visiting 10,000 nodes or consuming 5 seconds.
Cross-Shard Degradation
Cross-shard traversals degrade predictably under load. A 2-hop query crossing 3 shards requires sequential round trips: fetch Shard 1 (20ms), network hop (10ms), fetch Shard 2 (20ms), merge (10ms) = 60ms minimum. Under load, queuing delays compound, pushing p99 (99th percentile latency: the slowest 1% of requests) from 60ms to 500ms+.
Consistency Anomalies
Asynchronous (non-blocking) replication causes subtle bugs. User updates privacy settings (removing an edge) on primary, but replicas lag by 10 seconds. Downstream services reading from replicas see stale neighbor lists, potentially granting incorrect access.
Mitigations: read-your-writes guarantee (after writing, subsequent reads from the same session see the written data, even if it means routing to primary) for sensitive operations. Also: bounded staleness windows and rate limiting on high-churn subgraphs.

💡 Key Takeaways

✓Unbounded traversals on loopy graphs explode combinatorially; mitigate with depth limits (4 hops), cycle detection, query budgets (10K nodes or 5s)

✓Cross-shard 2-hop: 20ms + 10ms + 20ms + 10ms = 60ms minimum. Under load, p99 (99th percentile, slowest 1%) reaches 500ms+

✓Single slow or unavailable shard blocks entire cross-partition query; design partitioning to minimize cut edges

✓Async replication lag (10s typical) causes stale reads; downstream may grant incorrect access from lagged replicas

✓Read-your-writes guarantee ensures session sees own writes even if it means routing to primary; critical for permission changes

✓Monitor replication lag, visited nodes per query, and tail latencies by hop count to detect degradation before user impact

📌 Interview Tips

1Debug unbounded query: "all paths between users" visits 10M nodes in 10 minutes. Add depth limit=4, query visits 50K nodes in 200ms.

2Consistency bug: user removes follower, replica lag = 15s. Fan-out service reads stale list, sends notification to removed follower. Fix: read-your-writes for permission changes.

3p99 debugging: p99 jumps from 50ms to 400ms. Investigation: queries crossing 2+ shards have 8x higher tail latency. Solution: repartition by community.

← Back to Graph Databases (Neo4j) Overview