What Are Vectorized Execution and Late Materialization in Columnar Engines?

Traditional row oriented query engines process one tuple at a time through an iterator model, paying high per row interpretation overhead. Columnar engines instead operate on batches of 1,024 to 16,384 values at once in a vectorized execution model. Filters, projections, and aggregations process entire columnar arrays with tight loops that Single Instruction Multiple Data (SIMD) instructions can accelerate, executing the same operation on 4 to 16 values per Central Processing Unit (CPU) cycle. This amortizes function call overhead and improves instruction cache locality.

Vectorized operators often work directly on encoded or compressed data. Filtering a dictionary encoded column compares small integer ids rather than full strings, and aggregations on Run Length Encoded (RLE) data can skip over runs without decompressing every value. ClickHouse achieves billions of rows processed per second per node by combining vectorized execution with lightweight LZ4 compression and efficient encodings, keeping data compact in Central Processing Unit (CPU) caches.

Late materialization defers reconstructing full rows until absolutely necessary. A query filtering on column A and projecting columns B and C will first evaluate the filter entirely on column A, producing a bitmap or list of qualifying row identifiers. Only then does it fetch columns B and C for those specific rows. This avoids reading and stitching together columns for rows that will be filtered out. For queries with high selectivity (filtering from billions of rows down to thousands), late materialization reduces memory bandwidth and cache pollution dramatically.

The trade off appears when queries need many columns or have low selectivity. Reconstructing wide rows from many column chunks introduces random access patterns and increases Central Processing Unit (CPU) overhead from stitching. Point lookups requiring full rows can underperform row stores because fetching a single record requires touching multiple column files. Production columnar systems optimize for analytical workloads scanning many rows with few columns and high filter selectivity, accepting degraded performance on transactional access patterns.

💡 Key Takeaways

•Vectorized execution processes batches of 1,024 to 16,384 values at once, amortizing overhead and enabling Single Instruction Multiple Data (SIMD) to execute operations on 4 to 16 values per CPU cycle

•Operators work on encoded data directly: filtering dictionary ids instead of strings, aggregating Run Length Encoded (RLE) data without full decompression, keeping data compact in CPU caches

•Late materialization evaluates filters on minimal columns first, produces bitmap of qualifying rows, then fetches additional columns only for survivors, reducing memory bandwidth dramatically

•ClickHouse achieves billions of rows processed per second per node through vectorized execution combined with LZ4 compression and efficient encodings optimized for cache locality

•Trade off: wide SELECT queries or point lookups requiring full rows suffer from column stitching overhead and random access patterns, underperforming row stores for transactional workloads

📌 Examples

Apache Pinot and Druid: vectorized execution on dictionary encoded dimensions and pre aggregated star trees serve user facing analytics with p95 latencies under 100 milliseconds scanning millions to billions of rows

ClickHouse MergeTree: vectorized scans with SIMD filters process simple aggregations at 10 to 100 gigabytes per second aggregate throughput, with many OLAP queries returning in tens to hundreds of milliseconds on hot data

Late materialization in Parquet readers: filtering on timestamp column produces qualifying row ids, then fetches payload columns only for matching rows, avoiding gigabytes of unnecessary reads when selectivity is high (0.01% to 1%)

← Back to Columnar Storage & Compression Overview