Privacy & Fairness in ML • Data Anonymization (PII Removal, k-anonymity)Medium⏱️ ~3 min
Understanding K-Anonymity for Tabular Data Protection
K-anonymity is a formal privacy model that ensures every record in a dataset is indistinguishable from at least k minus 1 other records based on chosen quasi identifiers. You select quasi identifiers like age, gender, and ZIP code, then verify that every combination of these values appears at least k times in the dataset. Each group of identical records forms an equivalence class. If a class has fewer than k members, you must generalize the attributes further (age 34 to age band 30 to 35, ZIP 94102 to ZIP3 941) or suppress those records entirely.
In practice, organizations choose k values based on risk tolerance and audience. Internal analytics often use k equal to 3 or 5, while data shared externally or released publicly uses k equal to 10 or k equal to 100. Google's Privacy Design Principles Content (PDPC) guidance recommends k equal to 5 as a minimum for moderately sensitive data. The choice involves trade offs: higher k provides stronger protection but requires more aggressive generalization, which reduces data utility. With k equal to 100, rare demographic segments may be entirely suppressed, creating selection bias that harms model recall for underrepresented groups.
K-anonymity implementation runs in microbatches. Every 5 to 15 minutes, a distributed job computes equivalence classes using a group by operation on quasi identifiers, counts class sizes, and applies generalization hierarchies until all classes meet the threshold. The system tracks the proportion of rows satisfying k and the maximum reidentification risk, calculated as 1 divided by class size. For a class of 10 records, reidentification risk is 10%, meaning an attacker has a 1 in 10 chance of correctly identifying a specific individual.
Despite its intuitive appeal, k-anonymity has critical limitations. It protects against identity disclosure but not attribute disclosure. In a homogeneity attack, if all 10 records in an equivalence class share the same sensitive value like a medical diagnosis, an attacker learns that attribute with certainty. Background knowledge attacks occur when side information narrows possibilities: knowing a 45 year old female executive in ZIP3 021 is one of two records, plus news reports about her condition, enables linkage. These weaknesses led to stronger models like l-diversity and t-closeness.
💡 Key Takeaways
•K-anonymity ensures every record shares quasi identifier values with at least k minus 1 others, forming equivalence classes of minimum size k
•Organizations use k equal to 3 to 5 for internal analytics and k equal to 10 to 100 for external releases based on PDPC and industry guidance
•Maximum reidentification risk is 1 divided by k: for k equal to 10, an attacker has 10% chance of correctly identifying a specific individual in a class
•Homogeneity attack defeats k-anonymity when all records in an equivalence class share the same sensitive attribute value, leaking that information with certainty
•Higher k values provide stronger privacy but require more generalization, which can reduce model AUC by 0.5 to 2 points and suppress rare segments causing selection bias
•Background knowledge attacks use side information to narrow equivalence classes: knowing someone is a 45 year old executive in ZIP3 021 plus public news can enable reidentification
📌 Examples
A hospital dataset with k equal to 5 generalizes ages to 5 year bands and ZIP codes to ZIP3. A class with age 30 to 35, ZIP 941, gender Male contains 8 records, giving 12.5% reidentification risk
Chrome telemetry applies k equal to 10 by generalizing device model to family, OS version to major release, and country to region before releasing usage statistics to product teams
An e-commerce company uses k equal to 100 for public research datasets, which suppresses 8% of transactions from rare geographic regions, introducing bias against rural users in downstream models