What Are Bloom Filters and How Do They Work?
Structure and Operations
The structure consists of a fixed size bit array of m bits (all initialized to 0) and k independent hash functions. When inserting an element, you apply all k hash functions to compute k array positions and set those bits to 1. To check if an element exists, you hash it with the same k functions and check if all k corresponding bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element might be present, but you cannot be certain due to hash collisions from other elements. This asymmetry makes Bloom filters perfect for scenarios where checking a negative is expensive (like a disk seek or network call) but occasionally treating an absent item as present is acceptable.
Space Efficiency
Standard Bloom filters do not support deletions because clearing a bit might affect other elements that also hashed to that position. With proper sizing, you can achieve approximately 1% false positive probability at roughly 9 to 10 bits per element, regardless of how large your actual keys are. A 100 million key set at 1% false positive rate requires only about 120 MB, far smaller than storing actual keys or using a hash set which might need 8 to 16 GB for the same key count.
Mathematical Foundations
The false positive rate P approximates to (1 - (1 - 1/m)^(k*n))^k, where n is the number of inserted elements. For a target false positive rate P and expected element count n, optimal bit array size is m = -(n * ln(P)) / (ln(2))^2. The optimal number of hash functions is k = (m/n) * ln(2). At 1% false positive rate, you need about 9.6 bits per element and approximately 7 hash functions. Production systems must monitor actual n versus design n because exceeding capacity causes false positive rate to degrade from 1% to 20% or higher if capacity is exceeded by 5x.