CachingBloom FiltersEasy⏱️ ~3 min

What Are Bloom Filters and How Do They Work?

A Bloom filter is a probabilistic data structure that answers membership queries with extreme space efficiency by trading perfect accuracy for speed and memory. The structure consists of a fixed size bit array of m bits (all initialized to 0) and k independent hash functions. When inserting an element, you apply all k hash functions to compute k array positions and set those bits to 1. To check if an element exists, you hash it with the same k functions and check if all k corresponding bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element might be present, but you cannot be certain due to hash collisions from other elements. The key insight is that Bloom filters guarantee no false negatives but allow false positives. This asymmetry makes them perfect for scenarios where checking a negative is expensive (like a disk seek or network call) but occasionally treating an absent item as present is acceptable. Standard Bloom filters do not support deletions because clearing a bit might affect other elements that also hashed to that position. With proper sizing, you can achieve approximately 1% false positive probability at roughly 9 to 10 bits per element, regardless of how large your actual keys are. A 100 million key set at 1% false positive rate requires only about 120 MB, far smaller than storing actual keys or using a hash set. The mathematical relationships governing Bloom filters are well established. The false positive rate P approximates to (1 minus (1 minus 1/m) raised to (k times n)) raised to k, where n is the number of inserted elements. For a target false positive rate P and expected element count n, the optimal bit array size is m approximately equals negative (n times natural log of P) divided by (natural log of 2) squared. The optimal number of hash functions is k approximately equals (m divided by n) times natural log of 2. These formulas translate to practical rules: at 1% false positive rate, you need about 9.6 bits per element and approximately 7 hash functions.
💡 Key Takeaways
Bloom filters use a bit array of m bits and k hash functions to represent set membership in constant O(k) time for both insertion and lookup, independent of the number of stored elements n.
The structure guarantees zero false negatives (if it says an element is absent, it definitely is) but allows tunable false positives (if it says present, there is a probability P it is wrong due to bit collisions).
Space efficiency is exceptional: approximately 9.6 bits per element achieves 1% false positive rate regardless of key size. A 100 million element set requires only 120 MB compared to gigabytes for hash sets storing actual keys.
Optimal sizing follows m approximately equals negative (n times natural log of P) divided by (natural log of 2) squared for bit array size, and k approximately equals (m divided by n) times natural log of 2 for hash function count. At 1% false positive rate, k is approximately 7.
Standard Bloom filters do not support deletions because clearing a bit may invalidate other elements. Variants like Counting Bloom filters allow deletions but use 4 to 8 times more memory by replacing bits with small counters.
Production deployment requires monitoring the load factor (actual n versus design n) because inserting more elements than planned causes false positive rate to increase superlinearly, potentially degrading from 1% to 20% or higher if capacity is exceeded by 5x.
📌 Examples
Google Bigtable stores a Bloom filter with each immutable Sorted String Table (SSTable) file. When looking up a key, the system checks the Bloom filter first. With a 1% false positive rate, 99% of lookups for keys not in that SSTable avoid a disk seek. On spinning disks where a random seek costs 5 to 10 milliseconds, this saves massive latency. For 100 million keys, the Bloom filter occupies about 120 MB of memory per SSTable.
Chrome Safe Browsing uses a client side Bloom filter containing approximately 10 million hash prefixes of malicious URLs at 1% false positive rate, requiring about 15 MB of memory. This enables sub-millisecond local checks and reduces network lookups for safe URLs by 99%. Only the 1% false positives and actual malicious URLs trigger server verification.
RocksDB and LevelDB use Bloom filters with a common default of 10 bits per key (approximately 0.8% false positive rate). In random read workloads with many negative lookups, this avoids file and block reads, improving read throughput by 1.5 to 2 times compared to having no filter, primarily by eliminating unnecessary disk I/O operations.
← Back to Bloom Filters Overview