Privacy & Fairness in MLData Anonymization (PII Removal, k-anonymity)Medium⏱️ ~3 min

Understanding K-Anonymity for Tabular Data Protection

Definition
K-Anonymity ensures every record in a dataset is indistinguishable from at least k-1 other records based on quasi-identifiers (fields like zip code, age, gender that can combine to identify individuals).

HOW K-ANONYMITY WORKS

The algorithm generalizes quasi-identifier values until each unique combination appears at least k times. For example, with k=5: specific ages become ranges (25-30), zip codes become prefixes (9021*), and exact dates become months. After transformation, any attacker who knows your quasi-identifiers can narrow you down to at most k people but cannot uniquely identify you.

CHOOSING THE RIGHT K VALUE

Higher k provides stronger privacy but requires more generalization, reducing data utility. Typical choices: k=5 for low-risk internal analytics, k=10 for shared datasets, k=20+ for public releases with sensitive attributes. The minimum k depends on the attacker model—if adversaries have external data sources, higher k is needed.

💡 Key Insight: K-anonymity protects against identity disclosure but not attribute disclosure. If all k records in a group share the same sensitive value (all have cancer diagnosis), an attacker learns that attribute with certainty.

GENERALIZATION TECHNIQUES

Value generalization: replace exact values with ranges (age 34 → 30-40). Suppression: remove outlier records requiring excessive generalization. Cell suppression: replace specific values with wildcards (*). The optimal approach minimizes information loss while achieving k-anonymity.

⚠️ Key Trade-off: ML models trained on k-anonymized data typically see 5-15% accuracy drops. Loss is higher for tasks requiring fine-grained distinctions that generalization obscures.
💡 Key Takeaways
K-anonymity ensures every record is indistinguishable from at least k-1 others based on quasi-identifiers
Generalization replaces specific values with ranges; suppression removes outliers requiring excessive generalization
K-anonymity protects identity disclosure but not attribute disclosure when group members share sensitive values
📌 Interview Tips
1Start with k=5 for internal use, k=10 for shared data, and k=20+ for public releases with sensitive attributes
2Complement k-anonymity with l-diversity to protect against attribute disclosure attacks
← Back to Data Anonymization (PII Removal, k-anonymity) Overview