Cost and Security Telemetry for Production ML
Cost Observability
Cost and security observability are often afterthoughts but become critical at scale. For LLMs, track tokens in, tokens out, context length, generation length, and per request cost. Alert on cost anomalies when tokens per request increases more than 30 percent over baseline, as this often signals prompt template changes, context bloat, or abuse. Monitor cost per 1000 requests daily, comparing to weekly moving average to catch gradual cost drift before monthly bills spike. A sudden jump from 200 to 300 tokens per request can double infrastructure spend if undetected.
Security Telemetry
Monitor refusal rate, toxicity rate, PII leakage rate, and prompt injection attempts per 1000 requests. Baseline refusal rate might be 1 to 3 percent for a well tuned system; a spike to 8 to 10 percent indicates either a vendor model update increasing false positives or an attack pattern (jailbreak attempts). Toxicity and PII detection require input and output classification, with alerts triggered when rates exceed thresholds (toxicity greater than 0.5 percent, PII leakage greater than 0.1 percent).
Abuse Detection
DDoS or scraping spikes show up as sudden QPS increases (from 1000 to 10000 QPS) with abnormal distribution of user agents or source IP addresses. These spikes cause cascading latency violations and cost blowouts if not caught early. Monitor QPS per user identifier or IP prefix, alert when any single cohort exceeds 10x baseline, and enforce per user rate limits with exponential backoff.
Sampling and Cardinality Failures
Over aggressive log sampling (1 percent) misses rare but high impact events like PII leakage in long tail prompts or sophisticated prompt injection attempts. High cardinality labels (per user, per prompt template) explode metrics storage costs, forcing teams to drop dimensions and lose visibility into cohort specific abuse or cost patterns. The solution is stratified sampling that over samples rare events (errors, refusals, long generations) and uses approximate cardinality techniques (HyperLogLog) to track high cardinality dimensions.