Big Data SystemsLog Aggregation & AnalysisMedium⏱️ ~3 min

Full Text Inverted Index vs Label Based Chunk Storage

The indexing strategy is the primary lever controlling both performance and cost in log aggregation systems. Full text inverted indexes and label based chunk models represent fundamentally different trade offs between query flexibility and operational cost. Full text inverted indexes work like a book index, mapping every word or token to the log entries containing it. This enables rich ad hoc searches, regular expressions, and forensic investigations across any field. The cost is steep: index size adds 1x to 3x overhead on top of compressed data, write amplification during indexing consumes significant CPU and memory, and skewed data creates hot shard problems. A 100 TB compressed dataset might require 100 to 300 TB additional space for indexes. This approach shines for security hunting and root cause analysis where you cannot predict search patterns in advance. Label based chunk storage takes the opposite approach. You index only selected low cardinality dimensions like service name, namespace, region, or severity level. Raw log entries are compressed into large chunks (1 to 100 MB) and stored in object storage. Queries filter by indexed labels first, then scan matching chunks. This reduces costs by 3x to 10x compared to full text indexing because you are not indexing every word. The trade off is that free text searches become expensive or impossible, and poorly chosen labels lead to scanning entire datasets. In practice, hybrid approaches dominate at scale. You might use full text indexing for critical services or security logs where budget permits rich queries, while applying label based storage to high volume debug logs. Auxiliary techniques like bloom filters, precomputed facets, and sampling bridges help bridge the gap, letting you answer common questions cheaply while preserving deep dive capability for incidents.
💡 Key Takeaways
Full text inverted indexes add 1x to 3x storage overhead and write amplification but enable regex and ad hoc searches across any field, ideal for security forensics where query patterns are unpredictable
Label based chunk storage indexes only selected dimensions (service, region, severity) and stores raw logs in compressed chunks, reducing costs by 3x to 10x but requiring disciplined label usage
Index write amplification in full text systems consumes significant CPU and memory as every token must be mapped, creating shard hotspots under skewed data distributions
Poor label selection in chunk systems forces wide scans across entire datasets, negating cost benefits; high cardinality labels like user_id or request_id break the model with memory explosions
Hybrid strategies dominate production: full text for critical services and security logs, label based for high volume debug and info logs, with auxiliary bloom filters and precomputed facets bridging the gap
📌 Examples
A 100 TB compressed log dataset with full text indexing requires 100 to 300 TB additional space for inverted indexes, while label based approach keeps total storage to 10 to 30 TB by only indexing service, namespace, and severity
Netflix event pipeline uses label based approach with selected dimensions indexed, storing billions of events in compressed chunks in object storage, reserving full text for security and compliance workloads
← Back to Log Aggregation & Analysis Overview