Sampling vs Sketch Based Techniques

Two Families of Approximation:
AQP splits into two main approaches: sampling based and sketch based techniques. Understanding which to use and when is critical for system design interviews.

Sampling Based AQP:
Sampling maintains subsets of your data at different rates. You might create a 1% uniform sample where every row has a 1 in 100 chance of being included. The system runs normal Structured Query Language (SQL) style queries on these smaller tables and scales results appropriately. For example, a COUNT on a 1% sample gets multiplied by 100.

The advantage is generality. Any SQL query that works on the full table works on the sample. Joins work. Filters work. You just accept some statistical variance in the results. Companies typically maintain multiple sample rates: 0.1%, 1%, and 10% samples of the same base table, then the query optimizer picks the smallest sample that meets accuracy requirements.

The disadvantage is storage overhead. A 1% sample of a 5 PB table is still 50 TB. You also need batch jobs to keep samples fresh, which adds operational complexity.

Sketch Based AQP:
Sketches are specialized probabilistic data structures that track aggregate statistics in sublinear space. The most common is HyperLogLog (HLL), which estimates distinct counts using only 1 to 2 KB per group regardless of cardinality. A sketch tracking 100 million unique users takes the same space as a sketch tracking 100 unique users.

Sketches have magical properties for distributed systems: they are mergeable. You can compute sketches per partition, per hour, per region, then merge them at query time with simple union operations. This makes them perfect for streaming systems where data arrives continuously.

The tradeoff is specificity. Each sketch type supports exactly one operation. HyperLogLog does distinct counts. Quantile sketches do percentiles. Count Min Sketch does frequency estimation. You cannot run arbitrary SQL on a sketch.

Sampling
General purpose SQL, 50 TB for 1% sample
vs
Sketches
Specific operations, 2 KB per group
Real World Hybrid Approach:
Production systems combine both. Google BigQuery offers approximate aggregation functions that transparently use HLL sketches for distinct counts, making cardinality linear in number of groups rather than number of rows. For a query grouping by 1,000 values across billions of rows, this changes I/O from terabytes to megabytes.

Apache Druid and Pinot use sketches for known heavy queries like cardinality and quantiles, then fall back to sampling or exact computation for arbitrary aggregations. Uber runs approximate latency percentiles from sketch based structures updated in real time, while exploratory queries from data scientists run on sampled tables refreshed nightly.

💡 Key Takeaways

✓Sampling maintains 0.1% to 10% subsets of data and runs regular SQL queries, requiring 50 TB storage for 1% of a 5 PB table

✓Sketches like HyperLogLog use 1 to 2 KB per group regardless of cardinality, reducing I/O from terabytes to megabytes for high cardinality aggregations

✓Sampling supports any SQL operation but has storage overhead, while sketches are space efficient but limited to specific operations like distinct counts or percentiles

✓Sketches are mergeable across partitions and time, making them ideal for streaming systems that update continuously

✓Production systems use hybrids: sketches for known heavy queries like cardinality, sampling for exploratory ad hoc queries

✓BigQuery distinct counts are linear in groups not rows because of sketch based aggregation, changing billions of rows to megabytes of I/O

📌 Interview Tips

1Query: SELECT country, COUNT(DISTINCT user_id) FROM 10B events GROUP BY country with 200 countries. Sampling scans 100M rows (1% sample), taking 5 seconds. Sketches read 200 HLL structures (400 KB total), taking 100ms.

2Apache Druid maintains theta sketches for unique user counts per dimension combination, merged at query time from hourly segments to answer dashboard queries in under 200ms.

3Uber streaming pipelines update quantile sketches tracking API latency percentiles every second, queried by on call engineers during incidents for near real time diagnostics.

← Back to Approximate Query Processing Overview