Design FundamentalsURL Shortener DesignMedium⏱️ ~3 min

ID Generation Strategies: Trade offs Between Coordination, Predictability, and Collisions

Generating unique short URL tokens at scale requires choosing among several strategies, each with distinct trade offs around coordination overhead, predictability (enumeration risk), and collision probability. Counter based systems with Base62 encoding offer zero collisions and excellent write throughput by allocating disjoint numeric ranges to each application node or region. For example, a coordinator might assign each node a range of 1 million IDs; the node increments locally and Base62 encodes the result, eliminating per write coordination. This approach can easily support thousands of writes per second, but sequential keys are predictable and enable enumeration attacks unless combined with salting, mixing, or reserved token filtering. Random Base62 generation eliminates predictability by generating tokens randomly (e.g., 7 random characters from the 62 character set of A to Z, a to z, and 0 to 9). With 62^7 ≈ 3.5 trillion possibilities, collision probability is extremely low but non zero. The system must check for token existence in the database and retry on collision, adding a round trip. At 1,000 URLs per second, the first collision is expected after approximately 110 years of constant operation, but as the keyspace fills, collision rates increase. Hash based schemes (using MD5, SHA256, or similar) compute a deterministic hash of the canonical long URL and take a prefix as the token. This enables automatic deduplication: the same long URL always yields the same short token within a scope. However, truncating hashes increases collision risk, requiring collision resolution logic such as salting or trying alternate prefixes. A fourth approach, the Key Generation Service (KGS), pre generates a large pool of unique tokens offline and stores them in a database. Write requests simply pop a token from the pool, making writes extremely fast with no collision checks. The KGS refills the pool asynchronously, but this adds operational complexity: pool depletion can halt new URL creation, and ensuring global uniqueness across multiple KGS instances requires careful coordination. Multi region deployments often use time plus shard plus counter composition (similar to Twitter's Snowflake IDs), encoding timestamp bits, shard or region ID, and a per shard counter into the token. This avoids global coordination and guarantees uniqueness, but requires disciplined clock synchronization to prevent out of order or duplicate IDs from clock skew.
💡 Key Takeaways
Counter with range allocation: zero collisions, thousands of writes per second with no per write coordination; predictable keys risk enumeration unless salted or mixed
Random Base62 with 7 characters: first collision expected after approximately 110 years at 1,000 URLs per second; requires database existence check and retry on collision
Hash based (MD5 or SHA prefix): deterministic deduplication for identical long URLs; truncating hashes increases collision probability and requires resolution logic
Key Generation Service: extremely fast writes by popping pre generated tokens; operational complexity in pool management, refill latency, and ensuring global uniqueness
Snowflake style (time plus shard plus counter): no global coordination, uniqueness guaranteed; sensitive to clock skew which can cause out of order or duplicate IDs
At 100 million URLs per month, a 7 character Base62 keyspace lasts approximately 2,900 years; for stronger unguessability or faster growth, use 8 or more characters
📌 Examples
Instagram uses a variant of Snowflake IDs with 41 bits for timestamp milliseconds, 13 bits for shard ID, and 10 bits for auto incrementing sequence, Base62 encoded to generate short tokens without coordination
Bitly employs random Base62 generation with collision detection: generate a 7 character token, check existence in the database, and retry on collision (which occurs extremely rarely in practice)
A company using counter based allocation might assign Node 1 the range 1,000,000 to 1,999,999 and Node 2 the range 2,000,000 to 2,999,999, each incrementing locally and Base62 encoding to yield tokens like 4c92 or 8nP7
← Back to URL Shortener Design Overview