Data Governance & LineageFine-grained Access Control & PoliciesMedium⏱️ ~3 min

FGAC at Production Scale: Real Numbers

The Scale Challenge: Fine grained access control becomes dramatically harder when you operate a unified data platform with 5 petabytes of data, 50,000 tables, and query volumes hitting 200,000 interactive queries plus 50,000 scheduled jobs daily. The challenge is not whether FGAC works in theory. It is whether policy evaluation stays fast enough that p99 latency for interactive business intelligence remains under 30 seconds. Consider the mathematics. If policy evaluation adds 5 milliseconds per query, that is 1,000 seconds or 16 minutes of pure overhead across 200,000 queries. At 50,000 queries per hour during peak, you need a policy engine capable of handling roughly 14 decisions per second continuously. But p99 matters more than average. If 1% of policy lookups take 100 milliseconds instead of 1 millisecond due to cache misses or complex attribute resolution, you blow through latency budgets. Multi Layer Enforcement: Production systems enforce policies across multiple layers because data has many access paths. The warehouse enforces FGAC in SQL queries. But what about the vector store built from the same data for semantic search? Or the backup exports to object storage? Or the streaming pipeline that materializes aggregates? AWS explicitly addresses this for generative AI use cases: permission evaluation happens in Lake Formation for structured data, fine grained controls apply in OpenSearch for vector retrieval, encryption at rest uses Key Management Service (KMS), and output filtering applies privacy policies to generated responses. Each layer must enforce consistently or you create bypass paths.
Typical Production Scale
5 PB
DATA VOLUME
250K
DAILY QUERIES
30s
P99 TARGET
Caching and Staleness Trade-offs: Policy caching is essential at scale but introduces risk. With a 5 minute cache Time To Live (TTL), a terminated employee retains access long enough to potentially exfiltrate sensitive data. But disabling caching entirely might add 5 to 10 milliseconds per query at p99, which overwhelms the policy store at tens of thousands of queries per second. The typical compromise is tiered caching. Hot path decisions for active users cache with 1 minute TTL. Cold path decisions for service accounts cache longer. Critical revocations, like employee termination, trigger immediate cache invalidation. This keeps p99 policy overhead under 2 milliseconds while providing acceptable security.
❗ Remember: At 10x scale increases, pressure falls on the policy engine and permission tables. You might need to shard the policy store, pre compute permission views for frequent patterns, and carefully tune cache TTL versus revocation latency.
💡 Key Takeaways
At 250,000 daily queries, policy evaluation must complete in under 1ms to avoid dominating latency
Policies must apply across all data access paths: warehouse, vector search, object storage, and ETL
Caching with 1 to 5 minute TTL balances performance against revocation latency for terminated access
Production systems typically operate with p99 query latency targets under 30 seconds for interactive BI
Policy stores require sharding or pre computed views when scaling beyond tens of thousands of queries per second
📌 Examples
1A query that normally completes in 5 seconds degrades to 20 seconds p99 when complex user specific predicates cannot push down efficiently
2Vector embeddings built from customer support tickets require FGAC in OpenSearch to prevent search bypass of warehouse row level security
3Fired employee with 5 minute policy cache retains access long enough to export thousands of customer records before revocation takes effect
← Back to Fine-grained Access Control & Policies Overview
FGAC at Production Scale: Real Numbers | Fine-grained Access Control & Policies - System Overflow