Prasad Kavuri

Executive Cost Discipline

AI FinOps

AI spend scales unpredictably when cost is invisible until the invoice arrives. These are the levers that make AI spend forecastable, attributable, and defensible to a CFO — each backed by a live, interactive dashboard rather than a static claim.

Pricing basis

Cost figures across these dashboards use a representative token-rate table (Input $3/MTok, Output $15/MTok, Cache read $0.30/MTok, Cache write $3.75/MTok) for relative comparison — not a live billing feed. The goal is to make the shape of AI cost tradeoffs concrete, not to quote a specific vendor's current price sheet.

Cost per Request

Every model tier has a real per-request cost floor set by input/output token pricing. Routing the wrong request to the wrong tier is the single largest avoidable AI spend line.

Why it matters

Making this number visible per request — not just per month — is what turns AI spend from a black box into something a CFO can forecast.

Routing Savings

Complexity-aware routing sends simple queries to smaller, cheaper models and reserves large models for prompts that actually need them.

Why it matters

The router demo computes savings percent and projected dollar savings live, against a real baseline-vs-recommended comparison — not a static claim.

Cache Hit Rate & Token Efficiency

Prefix/KV cache hit rate directly determines how many tokens are billed at full input price ($3/MTok) versus cache-read price ($0.30/MTok) — a 10x difference.

Why it matters

Cache hit rate is a leading indicator of inference cost, tracked per model and per use case (RAG, multi-turn chat, code generation) in the control plane.

GPU Utilization

Idle GPU memory and compute is spend with no output. Utilization tracking across GPU memory, CPU memory, and disk shows whether capacity matches actual serving load.

Why it matters

Enterprises overprovision inference hardware by default; visibility into utilization is what justifies right-sizing decisions.

Speculative Decode Savings

Speculative decoding trades a small amount of draft-model compute for large reductions in end-to-end latency and, at scale, GPU-hours per request served.

Why it matters

This is the same runtime-layer lever behind DeepSeek’s June 2026 DSpark release — detailed with sourcing and caveats on the Runtime Engineering page.

Budget Governance & Business ROI

Per-team monthly budgets, spend-to-budget ratios, and alert thresholds turn AI cost from a reactive surprise into a managed line item, attributable to team and workload.

Why it matters

ROI conversations require attribution: which team, which workload, which model tier produced which outcome at what cost.