Data Governance & Lineage • GDPR & Data Privacy ComplianceMedium⏱️ ~4 min
GDPR in Distributed Data Pipelines
The Architecture Challenge:
At a company like Meta or Amazon with 100 million monthly active users, personal data flows through thousands of microservices at 1 million events per second. GDPR compliance means controlling this data across the entire pipeline: from user devices to edge APIs, through streaming systems, into data lakes and warehouses, then into machine learning feature stores and dashboards.
Here's how the flow works in practice:
1
Collection with consent checks: Mobile app sends analytics events with
user_id, device_id, and context. Privacy layer enforces consent at ingestion. If user opted out of personalization, events are tagged "analytics only" and excluded from ad targeting pipelines.2
Ingestion at scale: Events land in streaming system at 500,000 events per second with p99 latency of 2 seconds. Schema registry classifies fields as PII, quasi identifiers, or non sensitive. PII fields are tokenized or encrypted immediately, with raw identifiers written only to tightly controlled "hot PII" stores.
3
Processing with purpose limits: Batch and streaming jobs must respect data retention and purpose tags. Advertising models only consume events with allowed purposes. Historical data beyond 13 months is dropped or aggregated to prevent indefinite accumulation.
4
Controlled access: Business intelligence tools query sanitized, pseudonymized views. Row and column level policies ensure only authorized roles see direct identifiers. Aggregate queries prevent reidentification by enforcing minimum group sizes.
5
Rights handling: Data subject service tracks identities and rights requests. When user requests deletion, service publishes "deletion order" that flows through infrastructure. Downstream systems execute deletion within 7 to 30 day target SLA, including backups.
✓ In Practice: Google's data subject pipelines traverse all storage systems as a graph traversal and workflow orchestration problem. At scale, this isn't just a database delete, it's coordinating deletions across hundreds of datasets, replicas, and derived artifacts.
Scale Reality:
At 500,000 events per second ingestion rate, even 5 milliseconds of additional latency for tokenization adds up. Companies must balance synchronous protection (higher latency but immediate safety) versus asynchronous batching (lower latency but windows where raw PII exists in logs). The choice depends on data sensitivity and compliance risk tolerance.💡 Key Takeaways
✓At 500,000 events per second ingestion, even 5ms tokenization latency becomes significant cost requiring synchronous versus asynchronous trade off decisions
✓Schema registry automatically classifies fields as PII, quasi identifiers, or non sensitive to enforce consistent protection across all downstream systems
✓Purpose tags flow with data through entire pipeline, ensuring advertising models cannot consume events where user opted out of personalization
✓Deletion orders propagate through pub sub system to all consumers (databases, warehouses, caches, search indexes) with 7 to 30 day SLA targets
✓Data retention policies automatically drop or aggregate events beyond 13 months to prevent indefinite accumulation of personal data
📌 Examples
1Mobile app event tagged "analytics only" is ingested normally but excluded from recommendation engine pipeline that targets ads
2Email address tokenized to stable identifier like "token_8f3a2b" at ingestion, allowing joins and analytics while keeping raw PII in isolated key vault
3Deletion request for user publishes event to Kafka topic, consumed by 50+ systems that each mark data deleted and report completion status to orchestrator