Privacy & Fairness in ML • Model Interpretability (SHAP, LIME)Hard⏱️ ~3 min
Production Architecture for Model Explanations at Scale
In production ML systems, interpretability is a first class component with distinct online and offline pathways. Consider a credit risk platform handling 5,000 predictions per second with a p95 latency budget of 80 milliseconds. The model scores in 25 milliseconds. The system exposes two explanation modes: synchronous for customer support (returning top 3 to 5 feature attributions within 15 to 20 milliseconds) and batch for compliance audits (computing full attributions offline for all decisions with complete lineage).
At 5,000 requests per second, computing SHAP attributions for every prediction synchronously would require 20 to 50 dedicated CPU cores if each explanation takes 2 to 5 milliseconds. Most teams avoid this overhead through selective computation: explain only 1 to 5 percent of traffic on demand, cache recent explanations keyed by feature vector hashes for 24 hours, and precompute top K attributions for common scenarios. For batch processing, nightly or hourly jobs compute explanations for millions of decisions. Storing 100 feature contributions as doubles (800 bytes per instance) for 50 million monthly decisions requires only 40 to 80 GB, which is cheap compared to raw feature logs.
The explainer service consumes the same feature vectors as the model to avoid divergence. It never re-extracts features. Every explanation is versioned and tied to a specific model snapshot, feature manifest, background sample identifier, and explainer configuration. Retrieval happens through a feature store or dedicated explanation store, joined by decision ID and model version, with retention aligned to regulatory requirements (often 3 to 7 years for financial services).
Reliability patterns include circuit breakers that fall back to precomputed top K reasons if the explainer exceeds latency budgets, rate limiting user triggered explanation requests to prevent abuse or model stealing, and separate p95 and p99 latency monitoring for the explainer tier. Meta's internal evaluation services integrate gradient based attributions for deep learning models with strict versioning and access controls, while Google operationalizes explainability through managed pipelines that plug into training and serving infrastructure.
💡 Key Takeaways
•Synchronous explainers return top 3 to 5 attributions in 15 to 20 milliseconds for 1 to 5 percent of traffic, requiring 20 to 50 dedicated CPU cores at 5,000 rps.
•Batch pipelines compute full explanations nightly or hourly for all decisions, storing 800 bytes per instance (100 double precision attributions plus metadata) with 40 to 80 GB monthly for 50 million decisions.
•Explanations are versioned and tied to model snapshot, feature manifest, background sample ID, and explainer config to ensure reproducibility and detect drift after updates.
•Caching by feature vector hash reduces redundant computation, with 24 hour time to live for frequently requested explanations at the edge.
•Circuit breakers fall back to precomputed top K reasons if explainer latency exceeds budget, while rate limiting prevents abuse and model stealing through explanation APIs.
•Separate p95 and p99 latency monitoring for the explainer tier isolates performance issues from model serving, with alerts on attribution drift as a canary for data drift.
📌 Examples
Meta integrates gradient based attributions into internal evaluation services with strict version control, access logging, and filtered output to prevent exposure of sensitive proxy features.
Google Cloud Vertex AI operationalizes explainability through managed pipelines that version explanation artifacts alongside model checkpoints, with automatic invalidation on feature schema changes.
Microsoft Azure ML explanation store joins with feature store on decision ID and model version, supporting compliance queries that retrieve full attribution vectors for audits spanning 3 to 7 years.
A fintech platform caches SHAP explanations for common credit profiles (income bands cross product tier) at the edge, achieving 70 percent cache hit rate and reducing explainer load by two thirds.