Closed-loop LLM quality pipeline — LLM-as-Judge scoring, offline/online eval modes, and CI regression gating that blocks bad deploys automatically
A second LLM scores each response against weighted criteria before it reaches the user.
Prompt
Response under evaluation
Offline evals run on a curated dataset before deploy. Online evals sample live traffic in production.
27-case golden dataset, run on every CI push
| Case | Query | Fidelity | Hallucin. | Status |
|---|---|---|---|---|
| eval-001 | Current role at Krutrim | 0.97 | 0.01 | pass |
| eval-002 | Years at HERE Technologies | 0.93 | 0.03 | pass |
| eval-003 | Ola Maps scale achieved | 0.91 | 0.04 | pass |
| eval-004 | AI stack used at Krutrim | 0.88 | 0.06 | pass |
| eval-005 | Education background | 0.79 | 0.12 | fail |
Eval scores are compared against a stored baseline on every push. A regression blocks the deploy.
All metrics above threshold — safe to ship
tsc --noEmit + eslint
323 tests
27 eval cases
fidelity Δ vs last deploy
fidelity ≥ 0.85, hallucination ≤ 0.10
Vercel edge network
This pattern ships inside src/lib/eval-engine.ts and src/lib/guardrails.ts and runs as part of the CI suite on every push.