What is Kappa Architecture?

Definition
Kappa Architecture is a data processing pattern where ALL data flows through a single stream processing system, using an immutable event log as the source of truth for both real time processing and historical reprocessing.

Imagine you build a recommendation engine that needs to respond to user clicks within seconds. Six months later, your data scientists improve the algorithm and need to recalculate recommendations for all users based on historical behavior. Traditional systems force you to build two separate pipelines: one for real time (stream processing) and one for historical data (batch processing). This doubles your code, doubles your bugs, and doubles your operational complexity.

The Core Problem:

Modern products need two things that seem contradictory: low latency decisions (under 2 seconds) AND the ability to recompute results when business logic changes. Classic batch Extract, Transform, Load (ETL) handles recomputation well but delivers results in hours or days. Lambda Architecture tried to solve this by running both batch and stream layers in parallel, but you maintain separate code paths for the same logic.

How Kappa Solves This:

Kappa uses a single paradigm. Every piece of data is written as an event to an append only log (like a Kafka topic with long retention). One stream processing engine handles everything. For real time needs, it reads from the tail of the log processing fresh events. For recomputation, the same code reads from the beginning of the log, replaying months of history at high speed.

When you deploy new logic, you start a new version of your streaming job that replays the historical log while the old version continues serving. Once the new version catches up and passes validation, you switch traffic over. No separate batch code. No reconciliation between two systems.

A Concrete Example:

An e commerce site writes every user click, search, and purchase to a central event log. A streaming job consumes these events to build user profiles for recommendations, updating within 1 to 2 seconds. When the team ships a new recommendation model needing different features, they deploy a new streaming job that replays 90 days of events from the log at 5 times real time speed, building new profiles in parallel. After catching up, traffic switches to the new profiles. Same code path, same infrastructure.

💡 Key Takeaways

✓Single stream processing layer handles both real time and historical data, eliminating the need for separate batch and streaming code paths

✓Immutable event log with long retention (typically 30 to 180 days) acts as the source of truth for all processing

✓Reprocessing is done by replaying the event log from the beginning with the same streaming code, often at 3 to 5 times real time throughput

✓Materialized views (derived data stores) are considered disposable and can be rebuilt by replaying the log whenever business logic changes

📌 Interview Tips

1E commerce platform writes 200,000 events/sec to central log. Streaming job builds user profiles for recommendations within 1 to 2 seconds. When deploying new model, replay 90 days of history at 5x speed to rebuild profiles.

2Fraud detection system consumes transaction events with p99 latency under 2 seconds from ingestion to score. When rules change, replay historical events to recompute risk scores without writing separate batch job.

← Back to Kappa Architecture Pattern Overview