Model Serving & InferenceMonitoring & Observability (Latency, Drift, Performance)Medium⏱️ ~2 min

Cost and Security Telemetry for Production ML

Cost and security observability are often afterthoughts but become critical at scale. For Large Language Models (LLMs), track tokens in, tokens out, context length, generation length, and per request cost. Alert on cost anomalies when tokens per request increases more than 30 percent over baseline, as this often signals prompt template changes, context bloat, or abuse. Uber and Airbnb monitor cost per 1000 requests daily, comparing to weekly moving average to catch gradual cost drift before monthly bills spike. A sudden jump from 200 to 300 tokens per request can double infrastructure spend if undetected. Security telemetry includes monitoring refusal rate, toxicity rate, Personally Identifiable Information (PII) leakage rate, and prompt injection attempts per 1000 requests. Baseline refusal rate might be 1 to 3 percent for a well tuned system; a spike to 8 to 10 percent indicates either a vendor model update increasing false positives or an attack pattern (for example, jailbreak attempts). Toxicity and PII detection require input and output classification, with alerts triggered when rates exceed thresholds (for example, toxicity greater than 0.5 percent, PII leakage greater than 0.1 percent). Rate limiting and blocklists mitigate prompt injection patterns, but detection must be integrated into the observability stack to measure effectiveness. Abuse patterns manifest as Application Programming Interface (API) latency and throughput anomalies. Distributed Denial of Service (DDoS) or scraping spikes show up as sudden Queries Per Second (QPS) increases (for example, from 1000 to 10000 QPS) with abnormal distribution of user agents or source Internet Protocol (IP) addresses. These spikes cause cascading latency violations and cost blowouts if not caught early. Monitor QPS per user identifier or IP prefix, alert when any single cohort exceeds 10× baseline, and enforce per user rate limits with exponential backoff. Netflix uses adaptive rate limiting that tightens during detected abuse and loosens during normal traffic. Failure modes include blind spots from sampling and cardinality explosions. Over aggressive log sampling (1 percent) misses rare but high impact events like PII leakage in long tail prompts or sophisticated prompt injection attempts. High cardinality labels (per user, per prompt template) explode metrics storage costs, forcing teams to drop dimensions and lose visibility into cohort specific abuse or cost patterns. The solution is stratified sampling that over samples rare events (errors, refusals, long generations) and uses approximate cardinality techniques (HyperLogLog) to track high cardinality dimensions without full enumeration.
💡 Key Takeaways
Track tokens in, tokens out, context length, generation length, and per request cost; alert when tokens per request increases more than 30 percent over baseline to catch prompt bloat or abuse before monthly bills spike
Monitor refusal rate (baseline 1 to 3 percent), toxicity rate (threshold greater than 0.5 percent), PII leakage (threshold greater than 0.1 percent), and prompt injection attempts per 1000 requests with input and output classification
Abuse patterns (DDoS, scraping) manifest as sudden QPS spikes from 1000 to 10000 QPS with abnormal user agent or IP distribution, causing cascading latency violations and cost blowouts
Enforce per user or IP prefix rate limits, alert when any single cohort exceeds 10× baseline QPS, use adaptive rate limiting that tightens during abuse and loosens during normal traffic
Use stratified sampling to over sample rare events (errors, refusals, long generations) and approximate cardinality techniques (HyperLogLog) to track high cardinality dimensions without full enumeration avoiding metrics explosion
📌 Examples
Uber customer support: tokens per request jumped from 180 to 420 after prompt template change added verbose examples, caught by 30 percent cost anomaly alert, reverted within 2 hours saving $15K per day
Airbnb search assistant: refusal rate spiked from 2 percent to 9 percent after vendor model update, detected within 1 hour via continuous monitoring, prompted rollback and prompt template tuning
Netflix recommendation: DDoS attack increased QPS from 2000 to 25000 per second, per IP rate limiting kicked in at 10× baseline, blocked 98 percent of attack traffic while maintaining service for legitimate users
Meta content moderation: PII leakage in 0.08 percent of outputs detected via output classification, triggered alert at 0.1 percent threshold, investigation found edge case in anonymization logic fixed within 4 hours
← Back to Monitoring & Observability (Latency, Drift, Performance) Overview
Cost and Security Telemetry for Production ML | Monitoring & Observability (Latency, Drift, Performance) - System Overflow