The Evaluation Pipeline Architecture

The Full System:
At companies like OpenAI, Anthropic, or Google, LLM evaluation is not a one time test before launch. It is a continuously running service tightly integrated into the training and deployment pipeline. Understanding this architecture is critical because it shows how evaluation scales from thousands of prompts in research to millions in production.

Here is how the pieces fit together.
Policy RepositoryHarm categories & rules
↓
Prompt Generator1M to 10M test prompts
↓
Evaluation Runner5k to 20k QPS batch
↓
Scoring LayerJudge models + humans
↓
Release GateBlock if metrics regress
Component 1: Policy Repository
This is version controlled documentation defining harm categories like self harm, hate speech, sexual content with minors, malware, fraud, and personal data disclosure. Each category has severity levels, edge case guidelines, and example prompts. Product teams treat policy changes as backward incompatible events requiring full model re evaluation. Think of this as your safety contract.

Component 2: Prompt Generator
Instead of static test sets, you maintain parameterized templates and automated generators. Hand written templates cover classic jailbreak patterns like roleplay attacks ("pretend you are..."), third person requests, or multi step manipulations. Model assisted generators take a harm category and propose candidate prompts designed to bypass guardrails. For a 70B parameter model release, you might generate 1 million to 10 million evaluation prompts covering all risk areas.

Component 3: Evaluation Runner
This internal service selects model variants, runs prompts at scale using batch inference across thousands of Graphics Processing Units (GPUs), and logs all inputs, outputs, and metadata. For cost efficiency, teams often start with smaller proxy models (7B or 13B parameters) to triage ideas before scaling to the full model. Target throughput is typically 5,000 to 20,000 queries per second (QPS) with p95 latency around 300 to 800 milliseconds per prompt.
Evaluation Scale
10M
TEST PROMPTS
20k
QPS THROUGHPUT
500ms
P95 LATENCY
Component 4: Scoring Layer
Outputs flow to safety classifiers, toxicity models, and LLM judge models that answer questions like "Does this violate self harm policy?" Judge models can score thousands of prompts per second, enabling continuous wide coverage. A sample of outputs (typically high severity or ambiguous cases) goes to human raters for ground truth calibration. Scores aggregate into metrics like attack success rate per category and refusal rate on benign prompts.

Component 5: Release Gate
Continuous Integration (CI) and model deployment pipelines query the metrics store and enforce policies. For example, block promotion if critical category regression exceeds threshold, or require manual review for non critical regressions. Dashboards show trend lines across model versions, helping teams understand whether safety is improving or degrading with each training run.

💡 Key Takeaways

✓Policy repository defines version controlled harm categories with severity levels and examples, treating changes as backward incompatible events requiring full re evaluation

✓Prompt generators use templates and model assistance to create 1 million to 10 million evaluation prompts per model release, covering all risk categories systematically

✓Evaluation runner achieves 5,000 to 20,000 QPS throughput with p95 latency of 300 to 800ms by batching across thousands of GPUs, often using smaller proxy models first

✓Release gates block model deployment if critical safety metrics regress, for example if self harm success rate exceeds 0.1 percent threshold

✓Continuous monitoring samples 0.1 to 1 percent of production traffic (thousands of prompts daily) feeding automated scoring and human review queues

📌 Interview Tips

1Policy example: Self harm category with severity 1 (general advice) to severity 5 (detailed instructions), each with 50+ edge case examples for calibration

2Prompt generation: Given 'fraud' category, generator produces variations like direct requests, roleplay scenarios, third person queries, multi step attacks

3Release gate scenario: New model shows 5 percent better general quality but self harm success rate increases from 0.08 to 0.15 percent, deployment blocked until fixed

← Back to LLM Evaluation & Red Teaming Overview