Privacy & Fairness in MLDifferential PrivacyHard⏱️ ~3 min

Production System Architecture for Differential Privacy

A production differential privacy system consists of several layers: ingest and identity linking, contribution bounding, pre aggregation, privacy transform with accounting, and release with hygiene rules. Events arrive from clients and are attached to stable user identifiers. The contribution bounding layer enforces per user caps, for example at most 1 count per city per day or clipping sum contributions to a bounded range like [0, 1000]. This converts unbounded global sensitivity into a known constant. Pre aggregation reduces cardinality by computing counts, sums, and histograms before applying noise, improving throughput and reducing the number of noisy releases. The privacy engine is deterministic given a seeded randomness source and assigns epsilon to each metric based on business value or statistical importance. It computes sensitivity, selects a mechanism (Laplace for pure epsilon privacy, Gaussian for delta approximate privacy with tighter composition), and adds calibrated noise. Every release is logged in a privacy ledger that tracks spent epsilon and delta per user population and time window. The release service applies minimum thresholds (for example, do not publish bins with noisy counts below 1,000), rounds outputs to reduce granularity, and enforces consistency constraints for hierarchical data like geographic rollups. Scale numbers: with 10 million daily active users generating 1 billion events per day, pre aggregation dominates latency. Adding noise is O(1) per metric and negligible. Many companies run DP jobs in daily or hourly batches with end to end latency in minutes to a few hours. Interactive systems allocate 0.1 to 1 epsilon per query and enforce a total per user budget of 1 to 10 per quarter. For large static releases, budgets in the teens are used: US Census 19.61, LinkedIn 14.4 over three months. All choices must be reviewed with legal and privacy teams and documented.
💡 Key Takeaways
Contribution bounding is critical: enforce per user caps before noise, such as at most k rows per partition per day or clipping numeric values to [min, max]. Forgetting this step can make sensitivity unbounded and break the privacy guarantee.
Privacy ledger tracks spent epsilon and delta per user population and time window, enforcing composition by summing epsilons across all releases touching the same users. Use Renyi DP accountant for tighter bounds on subsampled mechanisms.
Pre aggregation improves throughput: with 1 billion events per day, grouping into counts and histograms before adding noise reduces cardinality and makes the privacy transform O(1) per metric instead of O(n) per event.
Release hygiene includes minimum thresholds (do not publish bins with noisy count below 1,000), rounding to reduce granularity, and consistency enforcement for hierarchical rollups using constrained optimization or correlated noise.
Batch systems dominate production: daily or hourly jobs with end to end latency in minutes to hours. Interactive systems allocate 0.1 to 1 epsilon per query with quarterly budgets of 1 to 10. Static releases use epsilon 15 to 20 (US Census 19.61, LinkedIn 14.4).
📌 Examples
Google analytics pipeline: daily batch jobs with epsilon 0.5 to 2 per metric, contribution bounding at 1 count per user per partition per day, pre aggregation on Dataflow, noise added in privacy transform, ledger tracks budget
US Census TopDown Algorithm: epsilon 19.61 total, contribution bounding at household and person level, hierarchical geographic consistency enforced with constrained inference, billions of table cells released
Interactive DP query system: allocate epsilon 0.1 per query, enforce total budget 10 per user per quarter, apply minimum threshold of 1,000 on noisy counts, round to nearest 100
← Back to Differential Privacy Overview