Privacy & Fairness in MLDifferential PrivacyMedium⏱️ ~2 min

What is Differential Privacy?

Differential Privacy (DP) is a mathematical framework that guarantees you can learn useful information about a population while revealing almost nothing about any individual. The key insight is that the output of an analysis should look nearly identical whether or not your data is included. This is formalized through two parameters: epsilon (ε) controls the maximum multiplicative change in output likelihoods between neighboring datasets that differ by one person, and delta (δ) is a small failure probability typically less than 1 divided by dataset size. The smaller the epsilon, the stronger the privacy protection but the more noise you must add. For example, with epsilon 1 and a count query that has sensitivity 1 (adding or removing one person changes the count by at most 1), you add Laplace noise with scale 1. On a true count of 10,000, this gives about 0.01% relative error. The US Census used epsilon 19.61 for their 2020 redistricting statistics covering 300+ million people, while LinkedIn allocated epsilon 14.4 across three months for labor market insights. A critical property is composition: privacy loss accumulates across multiple analyses. If you run 10 queries each with epsilon 1 on the same population, your total privacy budget is epsilon 10. This means production systems need rigorous privacy accounting to track spent budget over time. Post processing is free, meaning you can transform differentially private outputs however you want without degrading privacy. This lets you apply thresholds, rounding, or derive ratios from noisy counts without consuming additional budget.
💡 Key Takeaways
Epsilon (ε) controls privacy strength: smaller values mean stronger privacy but more noise. Typical production values range from 0.1 per query for interactive systems to 15 to 20 for major public releases like the US Census.
Delta (δ) is the failure probability, kept negligible relative to population size (often 1e-5 or 1e-6). Setting delta too large risks catastrophic leakage for a small fraction of users.
Composition means privacy loss accumulates: 10 queries with epsilon 1 each consume total budget epsilon 10. Production systems must track and enforce lifetime budgets across all releases touching the same users.
Sensitivity determines noise calibration: for count queries sensitivity is 1, for sum queries you must clip individual values to a bounded range. Mis estimating sensitivity breaks the privacy guarantee.
Post processing theorem allows free transformations: you can round, threshold, or compute ratios from differentially private outputs without spending additional budget, though variance may increase.
📌 Examples
US Census 2020 TopDown Algorithm: epsilon 19.61 total budget for redistricting statistics, processing 300+ million people and producing billions of table cells
LinkedIn labor market insights: epsilon 14.4 split as 4.8 per month across three months for salary and hiring trends
Count query example: true count 10,000, epsilon 1, sensitivity 1, add Laplace(1) noise → expected absolute error is 1 (0.01% relative error)
← Back to Differential Privacy Overview