Privacy & Fairness in MLData Anonymization (PII Removal, k-anonymity)Medium⏱️ ~3 min

Pseudonymization vs Anonymization vs Differential Privacy

These three techniques serve different privacy goals and sit at different points on the utility versus protection spectrum. Understanding when to use each is critical for production ML systems that must balance privacy, linkability, and performance. Pseudonymization replaces direct identifiers with stable tokens that can be reversed or joined using a secret mapping. A token vault with Hardware Security Modules (HSMs) generates deterministic tokens per tenant and rotates keys every 60 to 90 days. Horizontally scaled vaults handle 50,000 tokenizations per second with single digit millisecond latency. The key property is reversibility and consistency: the same email always maps to the same token, enabling joins across datasets and tracking over time for attribution and lifetime value modeling. However, this is not anonymization. The holder retains the ability to reidentify individuals, and if keys leak or the mapping is compromised, protection collapses. Use pseudonymization for internal data processing where linkage is required but raw identifiers must not appear in logs, analytics systems, or model training data. Anonymization aims to make reidentification infeasible even for the data holder. Techniques include k-anonymity for tabular data, aggressive generalization that loses exact values, and suppression of outliers. Once properly anonymized, data can be released more broadly because reversing the transformation is computationally or practically impossible. The trade off is loss of linkability: you cannot join anonymized datasets across time or sources without risking reidentification. This is appropriate for external data sharing, public research datasets, and scenarios where regulators require irreversible protection. Differential privacy operates at query or aggregate level rather than record level. It adds calibrated Laplace or Gaussian noise to outputs based on sensitivity and a chosen epsilon parameter. Typical production deployments use epsilon values between 1 and 8 per metric and enforce per user daily privacy budgets to bound cumulative privacy loss from repeated queries. Apple uses local differential privacy on device for select telemetry stats, adding noise before data leaves the device. Google and Meta apply differential privacy to published aggregates and dashboards, ensuring that adding or removing a single user changes any query result by at most a bounded amount. Differential privacy provides mathematical guarantees but adds noise that reduces utility, and it struggles with high dimensional or sparse data. Use it for public metrics, dashboards, and APIs where users can run arbitrary queries that might leak information through repeated access.
💡 Key Takeaways
Pseudonymization uses reversible tokens for internal linkage, with token vaults handling 50,000 operations per second at single digit millisecond latency and 60 to 90 day key rotation
Anonymization makes reidentification infeasible even for the holder, appropriate for external sharing but breaks linkability across datasets and time
Differential privacy adds calibrated noise to aggregate outputs with epsilon values between 1 and 8 and per user daily budgets to provide mathematical privacy guarantees
Apple uses local differential privacy on device for telemetry, adding noise before transmission, while Google and Meta apply it to published aggregates and dashboards
Pseudonymization fails catastrophically if keys leak or the mapping is compromised, exposing all previously tokenized identifiers
Production ML teams typically layer all three: pseudonymize for internal feature stores, apply k-anonymity to released datasets, and add differential privacy to public APIs
📌 Examples
Spotify pseudonymizes user IDs with HMAC SHA256 and per tenant keys for internal event streams, enabling attribution across sessions while keeping raw IDs out of analytics warehouses
A healthcare research dataset applies k equal to 10 anonymization by generalizing diagnosis codes to higher level categories and suppressing rare conditions, then releases data under open license without reidentification risk
Google Ads enforces aggregation thresholds where counts must exceed 100 and adds Laplace noise with epsilon equal to 4 to campaign metrics, preventing advertisers from inferring individual user behavior through repeated queries
← Back to Data Anonymization (PII Removal, k-anonymity) Overview