Loading...
Data Pipelines & Orchestration • Pipeline Monitoring & AlertingMedium⏱️ ~3 min
How Pipeline Monitoring Works End to End
The Complete Flow:
Pipeline monitoring operates in layers, starting from raw event ingestion through to final serving tables. Imagine a typical large scale data platform: mobile and web clients send 500,000 events per second to an API gateway, which publishes to Kafka. A streaming layer (Flink or Spark Structured Streaming) processes these events, enriches them with user profile data, and writes to a real time store. Simultaneously, batch orchestration (Airflow, Argo) runs 3,000 daily jobs that read raw events, join with OLTP database snapshots, and produce dimensional models in a warehouse.
Instrumentation at Each Layer:
At the ingestion edge, monitor Kafka consumer lag in both time and message count. Netflix monitors when lag exceeds 5 minutes or 1 million messages per partition. For streaming, track the gap between event time (when the event occurred) and processing time (when it was processed). A typical SLO: p99 end to end latency under 60 seconds at 200,000 events per second.
Batch orchestrators emit standardized metrics per DAG run:
Ingestion Layer
Kafka lag, drop rate
↓
Streaming Layer
Processing latency, throughput
↓
Batch Orchestration
Job status, duration, row counts
↓
Data Quality Checks
Volume anomalies, freshness
pipeline_status, duration_seconds, input_row_count, and output_row_count. These flow into a time series database like Prometheus or Datadog. Within individual Spark jobs, instrument key stages: input rows read, output rows written, records dropped due to validation failures, and errors sent to dead letter queues.
Data Quality Layer:
Modern observability systems add continuous data profiling on top of operational metrics. They compute statistical profiles for thousands of tables: row count trends, distribution of numeric columns, null fractions, and schema evolution. When daily orders volume drops 40 percent compared to a 14 day baseline, an anomaly alert triggers. Companies like Uber and Meta have internal systems that profile critical tables every 15 to 60 minutes and route alerts to the owning team's Slack channel.
Typical Streaming SLO
60s
P99 LATENCY
200k/s
THROUGHPUT
💡 Key Takeaways
✓Ingestion monitoring: Kafka consumer lag tracked in seconds and message count, with alerts when lag exceeds 5 minutes or 1 million messages
✓Streaming SLO example: p99 end to end latency under 60 seconds while processing 200,000 events per second
✓Batch orchestration emits standard metrics: pipeline status, duration, input/output row counts, and error counts at each stage
✓Data quality profiling: continuous monitoring of row count trends, null fractions, and schema changes across thousands of tables with 15 to 60 minute intervals
📌 Examples
1Within a Spark job: log input_rows=10M, output_rows=9.8M, dropped_rows=200k (validation failures), errors_to_dlq=50 (parse failures)
2Streaming latency calculation: event occurs at T=0, reaches Kafka at T+2s, processed by Flink at T+8s, written to serving table at T+12s. End to end latency: 12 seconds
3Anomaly detection: daily_orders table averages 2.5M rows over 14 days. Today it has 1.5M rows (40% drop). Alert fires to owning team's Slack channel
Loading...