Privacy & Fairness in MLData Anonymization (PII Removal, k-anonymity)Medium⏱️ ~3 min

Pseudonymization vs Anonymization vs Differential Privacy

Definition
Three privacy techniques: Pseudonymization replaces identifiers with tokens (reversible), Anonymization removes identifying information (irreversible), Differential Privacy adds mathematical noise guarantees.

PSEUDONYMIZATION: REVERSIBLE TRANSFORMATION

Replaces identifiers with consistent tokens while maintaining a secure mapping table. The same person always gets the same token, enabling longitudinal analysis. Under GDPR, pseudonymized data is still personal data requiring compliance, but provides security: if the dataset is breached, identities stay protected as long as the mapping is secure.

ANONYMIZATION: IRREVERSIBLE REMOVAL

Permanently removes re-identification ability using k-anonymity, generalization, or suppression. Under GDPR, properly anonymized data is not personal data and falls outside regulatory scope. The challenge: proving data is truly anonymous is difficult, and auxiliary data can enable re-identification attacks years later.

💡 Key Insight: GDPR distinguishes pseudonymization (personal data, full compliance required) from anonymization (not personal data, GDPR-exempt). This legal distinction drives architecture decisions.

DIFFERENTIAL PRIVACY: MATHEMATICAL GUARANTEES

Adds calibrated noise to queries or training with mathematical bounds on privacy loss (epsilon). Unlike k-anonymity, protects against adversaries with arbitrary auxiliary information. For ML, DP-SGD clips gradients and adds noise during backpropagation, preventing models from memorizing training examples.

⚠️ Key Trade-off: Pseudonymization preserves full utility but requires compliance; anonymization reduces utility but eliminates compliance; differential privacy provides strongest guarantees but causes 5-20% accuracy loss.
💡 Key Takeaways
Pseudonymization is reversible with mapping table; anonymization is irreversible; differential privacy adds mathematical noise
GDPR treats pseudonymized data as personal data; anonymized data is exempt from regulation
Differential privacy protects against adversaries with arbitrary auxiliary information
📌 Interview Tips
1Use pseudonymization for internal analytics where compliance is manageable; anonymization for sharing
2Consider differential privacy when releasing aggregate statistics or training models on sensitive data
← Back to Data Anonymization (PII Removal, k-anonymity) Overview
Pseudonymization vs Anonymization vs Differential Privacy | Data Anonymization (PII Removal, k-anonymity) - System Overflow