How Masking Works: Classification to Enforcement

The Architecture:
Data masking at scale isn't a single operation. It's a pipeline that starts with identifying what's sensitive, defining policies, and enforcing them consistently across ingestion, storage, and query paths.

At a company ingesting 100,000 events per second, every event must be classified in near real time. The classification layer uses three detection methods: schema based rules (columns named email or phone_number), regex patterns (matching email formats or credit card structures), and ML classifiers trained on labeled PII samples to catch free form text containing sensitive data.

1
Ingest Classification: Events hit the pipeline, and a service detects PII fields using rules and ML models with p50 latency of 20 to 30 milliseconds.
2
Policy Lookup: Central policy engine determines transformations per role: analysts see tokenized user_id, support sees last 4 phone digits.
3
Transformation: Stateless operations like hashing happen inline. Reversible tokenization calls a vault service with p50 under 3 milliseconds, p99 under 15 milliseconds at 10k ops/sec.
4
Storage: Masked data lands in data lake and warehouse. Raw PII never touches disk in most systems, reducing breach surface area.
Policy Engine in Detail:
The policy engine is the brain of the system. It stores rules like 

💡 Key Takeaways

✓Classification happens at ingest using schema rules, regex patterns, and ML models to detect PII in real time at 100k events per second with 20 to 30 ms p50 latency

✓Central policy engine stores role based rules determining which transformations apply: analysts see tokens, support sees partial data, ML services get aggregated features

✓Tokenization vault enables reversible masking with p50 latency under 3 ms and p99 under 15 ms at 10k operations per second using sharded storage and caching

✓Format preserving transformations maintain structural validity (phone numbers stay 10 digits, emails keep @ symbol) so downstream systems and tests don't break

📌 Interview Tips

1Netflix generates synthetic datasets daily for dev/QA where feature distributions are preserved but direct identifiers are removed

2Google and Meta architectures use central authorization services that all data UIs and APIs consult before executing queries, rewriting them to apply role based masking

3Tokenization service with consistent hashing ensures same email always maps to same token across all systems, enabling joins while protecting raw PII

4Continuous scanning samples warehouse tables weekly to detect new PII patterns in unclassified columns like free form notes fields

← Back to Data Masking & Anonymization Overview