Batch Spark Cost
Spot instances reduce Spark costs by 60 to 80 percent. A 100 node Spark cluster processing 1 terabyte takes 30 minutes at roughly $50 on demand or $10 to $20 on spot. Incremental processing with Delta Lake reads only changed partitions, cutting compute by 10 to 100x for daily jobs with 1 to 5 percent data change. Auto scaling clusters start small and grow based on shuffle data size, avoiding over provisioning.
Streaming Flink Cost
Always on infrastructure costs 2 to 5x more than equivalent batch capacity. A Flink cluster processing 10,000 events per second with 1 hour state retention runs continuously at roughly $500 to $1,000 per month. Right sizing task parallelism to match throughput prevents over provisioning: if each task handles 1,000 events per second, 10,000 events per second needs 10 tasks plus 50 percent headroom.
Materialization Strategy
Not all features need streaming. Analyze feature freshness sensitivity: if a feature shows no quality degradation with 1 hour staleness, use batch instead of streaming and save 80 percent cost. Reserve streaming for the 10 to 20 percent of features where freshness materially impacts model performance. Teams often move 30 or more features from streaming to batch after sensitivity analysis shows no measurable CTR difference.
Capacity Planning
Size streaming clusters for peak plus headroom, not average. Black Friday traffic at 10x normal requires 10x streaming capacity or aggressive load shedding. Pre scale before known peaks. Monitor backpressure and consumer lag to detect under provisioning before it impacts freshness SLAs.
✓Exactly once requires coordinated checkpoints every 10 to 60 seconds plus idempotent or transactional sinks. Without both, failures cause duplicate or missing feature updates that degrade model accuracy
✓Checkpoint barriers injected into streams ensure consistent global snapshots: all operators checkpoint at the same logical stream position. With 500GB state, full checkpoint writes 130 MB per second to remote storage
✓Idempotent sinks like key value upserts (user_42 count = 10) produce same result on replay. Non-idempotent sinks like append or increment (count += 1) break exactly once, causing duplicates on recovery
✓Exactly once adds 5 to 15% throughput overhead from checkpoint coordination and pauses. At least once is simpler and faster but requires downstream dedup or tolerates approximate features
✓Incremental checkpoints only write changed state, reducing checkpoint time from 5 minutes to under 1 minute for 500GB state with 20% change rate per interval
✓Transactional sinks use two-phase commit: pre-commit during checkpoint, commit after checkpoint completes. Kafka transactional producer example: writes visible only after checkpoint barrier
1Uber trip features: Exactly once pipeline writes per-driver rolling features to Cassandra with upsert (idempotent). Checkpoint every 30 seconds. Recovery replays from last checkpoint without duplicate counts. Overhead: 8% throughput vs at least once.
2LinkedIn activity counts: At least once pipeline for approximate engagement counts. Accepts 0.1% overcount from duplicates after failures. Simpler ops, 15% higher throughput, no checkpoint coordination overhead.
3Netflix viewing aggregates: Exactly once with incremental checkpoints to S3. State: 400GB. Full checkpoint: 4 minutes. Incremental: 50 seconds (15% state change per interval). Recovery time: 2 minutes. Transactional writes to Delta Lake ensure atomic commits.
4Airbnb booking features: Pipeline failure during peak caused 2x click counts with at least once mode, biasing Click Through Rate (CTR) model training by 5%. Switched to exactly once with idempotent Redis upserts. Overhead: 10% throughput, eliminated training data corruption.