Read After Write Consistency with Position Based Tokens

Read after write consistency guarantees that after a client successfully writes data, any subsequent read by that same client will see the written data or a newer version. This is challenging in systems with asynchronous replication because replicas lag behind the leader, and routing a read to a lagging replica can return stale data that does not include the client's own recent write. The naive solution of always reading from the leader sacrifices the scalability benefits of read replicas. The naive alternative of waiting for all replicas to catch up adds unacceptable latency and fails entirely when replication lag is large.

The production proven solution, exemplified by Box's architecture, uses position based consistency tokens. After the leader commits a write, it captures its current replication log position (the log sequence number or byte offset of that commit) and returns this position to the client as an opaque token. The client attaches this token to subsequent read requests. The read router maintains real time tracking of each replica's applied position. When a read arrives with a token, the router selects the first available replica whose applied position is greater than or equal to the token's position, meaning that replica has definitely applied the write. If no replica has caught up yet, the system can either route to the leader (guaranteed to have the data, but adds load) or wait with a bounded timeout for a replica to catch up (trades latency for leader offload). Non critical reads without tokens go directly to any replica for maximum scale.

This approach delivered measurable results at Box: 75% of read traffic was served by replicas without sacrificing read after write consistency for critical paths and with zero added latency during normal replication lag (under 1 to 2 seconds). During severe lag spikes, retries become progressively more likely to succeed because each retry can target a replica that has caught up further. The key insight is scoping: only gate on replication positions relevant to the specific object or tenant being accessed, not global replication state. A user reading their own document only needs a replica caught up on that document's writes, not on unrelated writes to millions of other documents. This dramatically reduces perceived lag and avoids unnecessary waits.

💡 Key Takeaways

✓Position based tokens eliminate the need to always read from the leader after writes, allowing 75% or more of reads to be served by replicas while maintaining read after write consistency

✓The token contains the leader's log position at write commit time (sequence number or byte offset), not a timestamp, making it immune to clock skew between servers

✓Read routers track each replica's applied position in real time and select any replica where applied position is greater than or equal to the token position, routing to the fastest available replica that meets the requirement

✓Scoping tokens to specific objects or tenants is critical: a user's read only needs a replica caught up on that user's writes, not all writes in the system, dramatically reducing wait times

✓Non critical or eventual consistency acceptable reads omit the token and go to any replica for maximum throughput, while critical reads include tokens for strong consistency guarantees

✓Fallback strategies include routing to leader if no replica has caught up within a timeout (typically 100 to 500 milliseconds), trading added leader load for consistency guarantee

📌 Interview Tips

1Box routes significant writes (user uploads, permission changes) with position tokens; 75% of subsequent reads hit replicas with zero added latency under normal 1 second lag, offloading the leader substantially

2Meta's TAO achieves session level read your writes by routing user requests to the primary region or waiting for cache invalidations and replication to catch up after a write, using similar position tracking internally

3An e-commerce checkout flow captures the order commit position and includes it in the "view order" redirect, ensuring the user immediately sees their order even if replicas are 2 to 3 seconds behind

← Back to Replication Lag & Solutions Overview