Production Architecture for Model Explanations at Scale
The Latency Problem
Computing SHAP values is expensive. For N features, exact SHAP requires 2^N evaluations. With 50 features, that is over 1 quadrillion model calls. Even approximations take 100-500ms per prediction. If latency budget is 50ms and model takes 20ms, synchronous explanations are impossible. LIME is faster (10-50ms) but still significant. Production systems cannot compute explanations for every request.
Asynchronous Explanation Architecture
Decouple explanation from prediction. Predictions return in 50ms. Asynchronously, queue explanation requests. Separate workers compute SHAP and store in database keyed by prediction ID. Users or regulators retrieve pre-computed values. Typical: compute for 10-20% of predictions (sampled plus disputed/high-stakes decisions). Store 90 days for regulatory retention.
Caching and Approximation
Cluster-based caching: Group similar inputs into clusters. Compute explanations for centroids only. New inputs get centroid explanation with adjustments. Reduces computation 10-100x. Model distillation: Train simple model to predict SHAP values directly from features. Generates explanations in microseconds. Accuracy degrades (85-95% correlation with true SHAP) but speed is orders of magnitude better.
Scaling Explanation Workers
Explanation computation is CPU/GPU-bound. Size workers based on volume: 10,000 explanations per hour at 500ms each needs 1.4 worker-hours, so ~2 dedicated workers. Monitor queue depth and utilization. Auto-scale based on backlog.