Production Architecture for Model Explanations at Scale

The Latency Problem
Computing SHAP values is expensive. For N features, exact SHAP requires 2^N evaluations. With 50 features, that is over 1 quadrillion model calls. Even approximations take 100-500ms per prediction. If latency budget is 50ms and model takes 20ms, synchronous explanations are impossible. LIME is faster (10-50ms) but still significant. Production systems cannot compute explanations for every request.
Asynchronous Explanation Architecture
Decouple explanation from prediction. Predictions return in 50ms. Asynchronously, queue explanation requests. Separate workers compute SHAP and store in database keyed by prediction ID. Users or regulators retrieve pre-computed values. Typical: compute for 10-20% of predictions (sampled plus disputed/high-stakes decisions). Store 90 days for regulatory retention.
Caching and Approximation
Cluster-based caching: Group similar inputs into clusters. Compute explanations for centroids only. New inputs get centroid explanation with adjustments. Reduces computation 10-100x. Model distillation: Train simple model to predict SHAP values directly from features. Generates explanations in microseconds. Accuracy degrades (85-95% correlation with true SHAP) but speed is orders of magnitude better.
Scaling Explanation Workers
Explanation computation is CPU/GPU-bound. Size workers based on volume: 10,000 explanations per hour at 500ms each needs 1.4 worker-hours, so ~2 dedicated workers. Monitor queue depth and utilization. Auto-scale based on backlog.
⚠️ Key Trade-off: Synchronous explanations kill latency. Asynchronous adds complexity and delays user-facing delivery. Choose based on requirements.

💡 Key Takeaways

✓Exact SHAP requires 2^N evaluations, approximations take 100-500ms per explanation

✓LIME is faster (10-50ms) but still significant overhead for real-time serving

✓Decouple: async workers compute and store explanations for later retrieval

✓Cluster-based caching explains centroids only, reducing computation 10-100x

✓Distillation trains surrogate model predicting SHAP in microseconds (85-95% accuracy)

📌 Interview Tips

1Size workers: 10,000 explanations/hour at 500ms needs ~2 dedicated workers

2Store explanations 90 days for regulatory retention requirements

← Back to Model Interpretability (SHAP, LIME) Overview