Production Implementation: Multi-Tier PII Detection Pipeline
TIER 1: RULE-BASED DETECTION
Deterministic regex patterns for structured PII: emails, phone numbers, credit cards, SSNs, IPs. Fast (microseconds per record), high precision for well-formatted data, but misses variations and cannot detect free-text PII like names in sentences.
TIER 2: NER-BASED DETECTION
Named entity recognition (fine-tuned BERT or spaCy) detects PII in unstructured text: names, addresses, organizations. Set confidence thresholds based on risk—high-sensitivity data uses lower thresholds (more false positives, fewer misses).
TIER 3: STATISTICAL DETECTION
Detects quasi-identifiers by analyzing column uniqueness and correlation. High-cardinality columns are flagged as potential identifiers. Correlation analysis identifies field combinations enabling re-identification even when individual fields appear safe.
PIPELINE ARCHITECTURE
Run tiers sequentially: regex first (fast), NER second (broader coverage), statistical last (batch). Log detected PII with confidence for human review. Build feedback loops where reviewers correct mistakes to improve accuracy.