Failure Modes in Data Lakes and Lakehouses

Data lakes and lakehouses face distinct failure modes at scale that can cripple performance, governance, and reliability if not anticipated.

The small files problem occurs when streaming or micro batch writes create thousands of tiny objects (often under 10 MB each). Effects include long query planning times as engines list metadata for each file, high per request overhead from object storage APIs, and poor scan throughput since parallelism is limited by file count rather than data size. Symptoms appear when table scans that should take seconds stretch to minutes. Mitigations include targeting file sizes of 256 MB to 1 GB through write buffering, scheduled compaction to merge small files in hot partitions, and configuring commit intervals (1 to 5 minutes) to balance freshness against file proliferation.

Concurrent writer conflicts arise under optimistic concurrency control when multiple writers attempt to commit overlapping changes to the same partitions. The lakehouse validates that no conflicting concurrent modifications occurred before publishing a new snapshot. Under high contention, conflicts trigger retries, causing write amplification and potential reader or writer stalls. This manifests as spiky commit latencies jumping from 2 seconds to 30 seconds during peak traffic. Mitigations include reducing commit granularity to partition level isolation, implementing exponential backoff and retry policies, and serializing commits for known hotspot partitions.

Schema drift and evolution pitfalls occur when incompatible changes break downstream consumers. Type narrowing (changing a long to an int), dropping columns, or renaming fields without aliasing causes job failures. Field order changes in columnar formats like Parquet corrupt reads if consumers assume position based schemas instead of name based resolution. In a lakehouse serving hundreds of downstream jobs, one breaking schema change can cascade failures across the entire dependency graph within minutes. Mitigations include enforcing additive only schema changes by default, using schema registries with compatibility checks, adopting name based field resolution, and implementing soft deprecation periods (30 to 90 days) with dual field support before hard removals.

💡 Key Takeaways

•Small files problem: streaming writes create thousands of tiny objects (under 10 MB), causing minute long planning times and poor scan throughput, mitigate with 256 MB to 1 GB target sizes and compaction

•Concurrent writer conflicts: optimistic concurrency fails commits under high contention, causing retries and 2 second to 30 second latency spikes, mitigate with partition level isolation and backoff policies

•Schema evolution pitfalls: type narrowing, column drops, or field order changes break downstream jobs, cascading failures across hundreds of consumers within minutes

•Long running readers blocking cleanup: readers pin old snapshots preventing file vacuuming and growing storage, mitigate with max retention windows (7 to 30 days) and reader timeouts

•Data swamp without governance: undocumented datasets, conflicting schemas, duplicate tables, and low trust, symptoms include runaway storage cost and difficulty finding correct data

📌 Examples

Streaming lakehouse ingesting 10K events per second with 1 second micro batches creates 86,400 files per day per partition, query planning time jumps from 5 seconds to 3 minutes, resolved with 5 minute batching and nightly compaction

High traffic table with 20 concurrent writers experiences 40% commit failure rate during peak, retry storms amplify write latency to 30 seconds, resolved with partition level write serialization and exponential backoff

Schema change renames user_id to userId without aliasing, breaks 200 downstream Spark jobs within 10 minutes, requires emergency rollback and 2 week migration with dual field support

← Back to Data Lakes & Lakehouses Overview