Training ML Models with Differential Privacy (DP-SGD)

Differentially Private Stochastic Gradient Descent (DP-SGD) is the standard technique for training machine learning models with formal privacy guarantees. Instead of computing gradients on a batch and updating weights, DP-SGD computes per example gradients, clips each to a fixed L2 norm C (typically 0.1 to 10), aggregates them, and adds Gaussian noise proportional to C times a noise multiplier sigma before the weight update. Privacy is controlled by the sampling rate (batch size divided by dataset size), number of training steps, and sigma, tracked with a privacy accountant that reports total epsilon at a chosen delta (often 1e-5 or 1e-6).

The computational cost is significant. Per example gradient computation requires backpropagating through each sample individually or using efficient vectorized implementations, causing a 1.5 to 3 times training slowdown and extra memory overhead. Utility often drops compared to non private baselines: expect a few percentage points accuracy loss on vision or NLP tasks unless you tune aggressively, use larger datasets, or apply strong regularization. Google and Meta have demonstrated DP-SGD at scale, training models on datasets with millions of examples and reporting epsilon values in single digits for production models.

A key alternative is Private Aggregation of Teacher Ensembles (PATE), which trains multiple teacher models on disjoint data partitions and uses them to label a public or unlabeled student dataset. When aggregating teacher votes, noise is added proportional to the sensitivity of the vote count. The student model trained on these noisy labels is differentially private with respect to the teacher training data. PATE works well when you have access to unlabeled public data and can tolerate a two stage training process, avoiding the per step overhead of DP-SGD.

💡 Key Takeaways

•DP-SGD clips per example gradients to L2 norm C (0.1 to 10 typical) and adds Gaussian noise with standard deviation sigma times C. Privacy is tracked via Renyi DP accountant, reporting total epsilon at a small delta like 1e-5.

•Training slowdown is 1.5 to 3 times due to per example gradient computation and extra memory. Utility often drops by a few percentage points unless you use larger datasets, stronger regularization, or extensive hyperparameter tuning.

•Google and Meta have deployed DP-SGD at scale, training models on millions of examples with reported epsilon in single digits. Typical production settings: epsilon 1 to 10, delta 1e-6, clipping norm 1.0, noise multiplier 0.5 to 2.0.

•PATE (Private Aggregation of Teacher Ensembles) avoids per step overhead by training teacher models on disjoint shards and adding noise to aggregated votes when labeling student data. Works well when unlabeled public data is available.

•Advanced accounting with Renyi DP gives tighter epsilon bounds than basic composition, especially for subsampled mechanisms. Privacy accountant libraries (TensorFlow Privacy, Opacus) automate this tracking across training steps.

📌 Examples

TensorFlow Privacy library: implements DP-SGD with automatic privacy accounting, used by Google for production models with epsilon 2 to 8 and delta 1e-6 on datasets with millions of training examples

PyTorch Opacus: per example gradient engine for DP-SGD, reports epsilon via Renyi DP accountant. Typical usage: clipping norm 1.0, noise multiplier 1.0, batch size 256, epsilon 3 to 10 after thousands of steps.

PATE on MNIST: 10 teacher models on disjoint 6,000 example shards, aggregate votes with Laplace noise, label 9,000 public student examples, achieve epsilon 2.04 at delta 1e-5 with 98% accuracy

← Back to Differential Privacy Overview