Caching • Bloom FiltersEasy⏱️ ~3 min

What Are Bloom Filters and How Do They Work?

Definition
Bloom filter is a probabilistic data structure that answers membership queries with extreme space efficiency by trading perfect accuracy for speed and memory. It guarantees no false negatives but allows tunable false positives.
Structure and Operations
The structure consists of a fixed size bit array of m bits (all initialized to 0) and k independent hash functions. When inserting an element, you apply all k hash functions to compute k array positions and set those bits to 1. To check if an element exists, you hash it with the same k functions and check if all k corresponding bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element might be present, but you cannot be certain due to hash collisions from other elements. This asymmetry makes Bloom filters perfect for scenarios where checking a negative is expensive (like a disk seek or network call) but occasionally treating an absent item as present is acceptable.
Space Efficiency
Standard Bloom filters do not support deletions because clearing a bit might affect other elements that also hashed to that position. With proper sizing, you can achieve approximately 1% false positive probability at roughly 9 to 10 bits per element, regardless of how large your actual keys are. A 100 million key set at 1% false positive rate requires only about 120 MB, far smaller than storing actual keys or using a hash set which might need 8 to 16 GB for the same key count.
Mathematical Foundations
The false positive rate P approximates to (1 - (1 - 1/m)^(k*n))^k, where n is the number of inserted elements. For a target false positive rate P and expected element count n, optimal bit array size is m = -(n * ln(P)) / (ln(2))^2. The optimal number of hash functions is k = (m/n) * ln(2). At 1% false positive rate, you need about 9.6 bits per element and approximately 7 hash functions. Production systems must monitor actual n versus design n because exceeding capacity causes false positive rate to degrade from 1% to 20% or higher if capacity is exceeded by 5x.

💡 Key Takeaways

✓Bloom filters use a bit array of m bits and k hash functions for set membership in O(k) time for both insertion and lookup, independent of stored elements n.

✓Zero false negatives guaranteed (if it says absent, it definitely is) but tunable false positives (if it says present, probability P it is wrong due to bit collisions).

✓Space efficiency: approximately 9.6 bits per element achieves 1% false positive rate regardless of key size. 100 million elements requires only 120 MB versus gigabytes for hash sets.

✓Optimal sizing: m = -(n * ln(P)) / (ln(2))^2 for bit array, k = (m/n) * ln(2) for hash functions. At 1% false positive, k is approximately 7.

✓Standard Bloom filters do not support deletions. Counting Bloom filters allow deletions but use 4 to 8x more memory by replacing bits with counters.

📌 Interview Tips

1LSM tree storage engines store Bloom filters per SSTable file. At 1% false positive rate, 99% of lookups for absent keys avoid disk seeks. For 100 million keys, the filter occupies about 120 MB of memory per file.

2Client side URL blocklists use Bloom filters with 10 million entries at 1% false positive rate, requiring about 15 MB. This enables sub-millisecond local checks and reduces network lookups for safe URLs by 99%.

3Key value stores use Bloom filters with 10 bits per key (0.8% false positive rate). In random read workloads with many negative lookups, this improves read throughput by 1.5 to 2x by eliminating unnecessary disk I/O.

← Back to Bloom Filters Overview