A/B Testing & Experimentation • Statistical Significance & Confidence IntervalsMedium⏱️ ~3 min
Confidence Intervals: Precision and Practical Interpretation
A confidence interval is a range around your point estimate that would contain the true parameter in a fixed fraction of repeated samples. A 95 percent confidence interval means the procedure covers the truth 95 percent of the time in the long run, not that there's a 95 percent probability the true value lies in your computed range. This is a subtle but critical distinction for interpreting results.
You compute an interval as estimate plus or minus margin of error, where margin of error equals critical value times standard error. For large samples with normal distributions, use a z critical value (1.96 for 95 percent). For small samples or unknown variance, use a t critical value with degrees of freedom n minus 1. For proportions, standard normal approximations work when sample size times probability is above 10, otherwise use Wilson or Agresti Coull intervals which have better coverage near 0 or 1. For heavy tailed metrics like watch time or revenue, bootstrap intervals are more robust but computationally expensive at scale.
Confidence intervals carry far more information than p-values alone. They show both effect size and precision. For example, a new ranking model at Meta might show absolute CTR lift of 0.10 percentage points with 95 percent confidence interval of 0.06 to 0.14. This tells you the effect is statistically significant (interval excludes zero) and practically meaningful (lower bound of 0.06 still represents business value). Compare that to just knowing p less than 0.05, which says nothing about magnitude.
In production, companies pair intervals with business thresholds. Netflix might require the lower bound of the 95 percent interval for engagement to exceed zero AND the upper bound for p95 latency increase to stay below 10 milliseconds. This prevents shipping changes that are statistically significant but harm user experience or have uncertain business value.
💡 Key Takeaways
•A 95 percent confidence interval means the procedure captures the true value 95 percent of the time across repeated samples, not a 95 percent probability for this specific interval
•Intervals show both significance and magnitude: CTR lift of 0.10 with interval 0.06 to 0.14 tells you effect is real and large enough to matter
•Use z critical value (1.96 for 95 percent) for large samples, t critical value for small samples with n minus 1 degrees of freedom
•For proportions near 0 or 1 percent, Wilson or Agresti Coull intervals have better coverage than normal approximation
•Bootstrap intervals handle heavy tailed metrics like revenue or watch time but cost more computation at petabyte scale
•Production decisions pair intervals with thresholds: ship if lower bound for engagement exceeds zero AND upper bound for latency stays under 10 ms
📌 Examples
Meta ranking model: CTR lift 0.10 percentage points, 95 percent CI [0.06, 0.14], ship because lower bound is meaningful and positive
Netflix video recommendation: engagement lift 2.5 percent, 95 percent CI [1.8, 3.2], but p95 latency increase 95 percent CI [8, 15] ms exceeds 10 ms threshold, requires optimization
Uber ETA model: mean absolute error reduction 45 seconds, 95 percent CI [30, 60], guardrail on surge pricing revenue 95 percent CI [negative 0.5, positive 1.2] percent acceptable
Google Search ranking: CTR lift 0.15 percent, 95 percent CI [0.12, 0.18], with billions of queries the narrow interval gives high confidence despite small absolute effect