AI Evaluation Showcase

Closed-loop LLM quality pipeline — LLM-as-Judge scoring, offline/online eval modes, and CI regression gating that blocks bad deploys automatically

LLM-as-Judge

A second LLM scores each response against weighted criteria before it reaches the user.

Prompt

What is Prasad's most significant leadership achievement at Krutrim?

Response under evaluation

Prasad led the development of India's first Agentic AI platform at Krutrim, managing a team of 200+ engineers. He architected the end-to-end LLM infrastructure that powers Krutrim's products and scaled the AI org from 40 to 200 engineers within 18 months.
Factual Accuracy
weight 40%
Relevance
weight 25%
Completeness
weight 20%
Conciseness
weight 15%

Offline vs Online Evals

Offline evals run on a curated dataset before deploy. Online evals sample live traffic in production.

27-case golden dataset, run on every CI push

CaseQueryFidelityHallucin.Status
eval-001Current role at Krutrim0.970.01pass
eval-002Years at HERE Technologies0.930.03pass
eval-003Ola Maps scale achieved0.910.04pass
eval-004AI stack used at Krutrim0.880.06pass
eval-005Education background0.790.12fail
4 pass1 failavg fidelity 0.896

Regression Gate (CI)

Eval scores are compared against a stored baseline on every push. A regression blocks the deploy.

All metrics above threshold — safe to ship

Build & lint

tsc --noEmit + eslint

Unit tests

323 tests

Run eval suite

27 eval cases

Compare baseline

fidelity Δ vs last deploy

Regression gate

fidelity ≥ 0.85, hallucination ≤ 0.10

Deploy to production

Vercel edge network

How this pipeline fits together

  1. 1. LLM-as-Judge — a separate model scores every sampled response against weighted criteria (accuracy, relevance, completeness, conciseness).
  2. 2. Offline evals — a curated 27-case golden dataset runs on every CI push; one failing case blocks the merge.
  3. 3. Online evals — 1% of live traffic is scored asynchronously; drift triggers an alert before it becomes a customer problem.
  4. 4. Regression gate — fidelity and hallucination scores are compared to a stored baseline; a regression of more than 5% blocks the deploy entirely.

This pattern ships inside src/lib/eval-engine.ts and src/lib/guardrails.ts and runs as part of the CI suite on every push.