Caching • Bloom FiltersHard⏱️ ~3 min
Bloom Filters in LSM Storage Engines: Read Path Optimization
Bloom filters are a foundational component in Log Structured Merge (LSM) tree storage engines like Google Bigtable, Apache Cassandra, Apache HBase, and RocksDB. LSM trees organize data into immutable Sorted String Table (SSTable) files on disk, with writes buffering in memory (memtable) and periodic compaction merging files. A point lookup for a key may need to check multiple SSTables sequentially until the key is found or all candidates are exhausted. Without Bloom filters, each SSTable check requires a disk seek or at minimum a block read, which costs 5 to 10 milliseconds on spinning disks or 100 to 500 microseconds on Solid State Drives (SSDs). For keys not present in the database (negative lookups), you would check every SSTable, incurring multiple expensive I/O operations per query.
Each SSTable stores a Bloom filter for its key set in memory or memory mapped file, sized to approximately 1% false positive rate (typically 9 to 10 bits per key). During a read, the engine checks the memtable first, then for each SSTable candidate, checks its Bloom filter. If the filter says definitely absent (returns negative), the engine skips that SSTable's disk access entirely. With 1% false positive rate, 99% of negative lookups per SSTable avoid disk I/O. In a read path consulting 3 to 5 SSTables, Bloom filters can eliminate 2 to 5 disk seeks per miss, transforming multi-millisecond tail latencies into sub-millisecond responses. This optimization is critical for read heavy workloads where negative lookups dominate, such as social graph queries checking non-existent relationships or cache miss scenarios.
The trade-off is memory and CPU cost. A node storing 10 SSTables with 100 million keys each requires approximately 1.2 GB of memory for Bloom filters at 1% false positive rate. The CPU cost is approximately 7 hash computations per lookup per SSTable checked. Compaction policies directly affect Bloom filter efficiency: size tiered compaction can create many overlapping SSTables, increasing the number of filters checked per lookup and shifting cost from I/O to CPU. Leveled compaction maintains fewer overlapping files, reducing filter checks but requiring more frequent compaction I/O. Operators must balance compaction write amplification against read path filter overhead. For very high query rates (hundreds of thousands of queries per second), the CPU cost of checking many filters can become significant, necessitating blocked Bloom filters or caching of filter results.
💡 Key Takeaways
•LSM engines like Bigtable, HBase, Cassandra, and RocksDB store one Bloom filter per SSTable in memory, sized to approximately 1% false positive rate (9 to 10 bits per key). For 100 million keys per SSTable, this requires approximately 120 MB of memory per file, scaling to 1.2 GB for 10 SSTables per node.
•Point lookups check the Bloom filter before accessing each SSTable. With 1% false positive rate, 99% of negative lookups per SSTable avoid disk I/O. In a read path checking 3 to 5 SSTables, filters eliminate 2.97 to 4.95 disk accesses per miss, reducing tail latency from potentially 15 to 25 milliseconds (5 seeks at 5 ms each) to under 1 millisecond.
•Compaction strategy directly impacts Bloom filter overhead. Size tiered compaction creates many overlapping SSTables, increasing the number of filters checked per query and CPU cost. Leveled compaction maintains non-overlapping levels, reducing filter checks but increasing compaction write amplification by 10 to 30 times.
•Bloom filters provide negligible benefit for range scans because scans must read multiple sequential keys from disk regardless. Prefix Bloom filters or block level filters can help range queries by filtering at the block granularity, but add complexity and are workload dependent.
•CPU cost scales with number of SSTables checked. At 7 hash operations per filter per SSTable, checking 5 SSTables costs 35 hash computations per lookup. For workloads exceeding 100,000 queries per second per core, blocked Bloom filters or filter result caching become necessary to prevent CPU from becoming the bottleneck.
•False positive handling must be efficient. When a Bloom filter returns maybe present (1% of negatives), the engine performs a disk read that finds nothing. Systems should track false positive rates per SSTable via sampling to detect degraded filters from capacity overshoot or poor hashing, triggering proactive filter rebuilds during compaction.
📌 Examples
Cassandra allocates Bloom filter memory per SSTable proportional to key count, targeting 1% false positive rate. A node with 10 SSTables containing 100 million keys each uses approximately 1.2 GB for Bloom filters. This configuration reduces disk touches for negative lookups from 10 potential seeks (50 to 100 milliseconds total on spinning disks) to approximately 0.1 seeks on average, improving tail latency dramatically.
Google Bigtable pioneered per SSTable Bloom filters for Tablet servers. With spinning disks (5 to 10 ms seek time), a read path checking 5 SSTables without filters would incur 25 to 50 milliseconds of seek time for a miss. Bloom filters at 1% false positive rate reduce this to approximately 0.25 to 0.50 milliseconds of wasted seeks plus sub-millisecond filter checks, cutting tail latency by 50 to 100 times.
RocksDB with leveled compaction and 10 bits per key Bloom filters handles 100,000 point lookup queries per second with 70% negatives. Filters eliminate approximately 69,300 disk reads per second (99% of 70,000 negatives), while 700 false positives per second still trigger disk. The memory cost is approximately 10 bits times 1 billion keys equals 1.25 GB, and CPU cost is approximately 7 hash operations times 100,000 queries per second equals 700,000 hash operations per second, easily sustained on modern cores.