A/B Testing & ExperimentationRamp-up Strategies & Canary AnalysisHard⏱️ ~3 min

Implementation: Traffic Routing, Metric Collection, and Decision Engine

TRAFFIC ROUTING IMPLEMENTATION

Implement consistent hashing in your load balancer or application layer. bucket = hash(user_id) mod 10000. At 5% canary, route buckets 0-499 to new version. For stratified sampling, first assign users to segments (mobile buckets 0-5999, desktop 6000-9999), then apply percentage within each segment. Store the assignment decision in a cookie or session to ensure sticky routing.

METRIC COLLECTION PIPELINE

Tag every request with: version (control/canary), cohort bucket, user segment, timestamp. Aggregate at 1-5 minute granularity for fast feedback. Use 5-15 minute trailing windows for comparison. Stream events to a real-time aggregation system that computes running statistics: mean latency, error rate, CTR by cohort. Store raw events for post-hoc analysis.

TWO-LAYER DECISION ENGINE

Layer 1 (automatic): Guardrail violations trigger immediate rollback. Error rate delta > 0.1% for 5 minutes, P99 > 300ms for 10 minutes, or feature null rate > 1%. No human intervention required.
Layer 2 (statistical): Product metrics require statistical tests with CUPED adjustment and multiple comparison correction. Compute confidence intervals and p-values. Alert humans for borderline cases.

✅ Best Practice: Automate guardrail rollbacks but require human approval for product metric decisions. Machines catch crashes; humans judge trade-offs.

RAMP SCHEDULE WITH GATES

Typical schedule: 0.5% → 1% → 5% → 10% → 25% → 50% → 100%. Between each step, verify: latency delta within bounds, error rate stable, downstream service headroom sufficient, no data quality alerts. Log each gate decision for audit. Full ramp typically takes 24-48 hours with conservative gates.

💡 Key Takeaways
Routing: hash(user_id) mod 10000 → bucket; at 5%, route buckets 0-499 to canary; store in cookie for sticky assignment
Metrics: tag every request with version/cohort/segment, aggregate at 1-5 min granularity, use 5-15 min trailing windows
Two-layer decisions: Layer 1 auto-rollback on guardrails (error, latency, nulls), Layer 2 statistical tests for product metrics
Ramp schedule: 0.5% → 1% → 5% → 10% → 25% → 50% → 100% with gates verifying latency, errors, downstream capacity
📌 Interview Tips
1Walk through the hash implementation: user 123456 hashes to bucket 6456, stays in control until 64.56% ramp
2Explain two-layer decisions: machines auto-rollback on guardrails, humans approve product metric trade-offs
3Describe a typical ramp schedule with gates: 24-48 hours from 0.5% to 100% with verification at each step
← Back to Ramp-up Strategies & Canary Analysis Overview