LLM Router Demo

Real-time multi-model routing — see how intelligent routing achieves cost and latency savings

FinOps Routing Signal

Per-model cost values shown below are illustrative estimates from the configured token-rate table in `/api/llm-router`, used for relative comparison only.

Route intent: prefer lower-cost low-latency tiers for simple queries, then escalate to higher-capability models only when prompt complexity requires it.

Business Value Mode

Mid Tier Model Cards

Mid-tier routes balance coding quality, latency, and single-GPU deployment constraints.

reported benchmarks

Qwen 3.6 27B

mid tier · dense

dense

Apache 2.0262K ctxThinking PreservationFP8GGUF / Local

Dense 27B model with reported 77.2% SWE-bench Verified — flagship coding performance on single-GPU hardware. Native Thinking Preservation retains chain-of-thought across turns. Unsloth GGUF variant available for local deployment via llama.cpp or LM Studio — tool calling supported, no API dependency.

VRAM: ~16.8GB (Q4)

Context: 262,144 tokens

Agentic codingMulti-turn stabilityRepository-level reasoningFrontend generation

Local Deployment: Unsloth GGUF (Q4_K_M ~16.8GB VRAM) — HuggingFace ↗

Qwen3-35B-A3B-int4

mid tier · moe

moe

MoEint43B activeSingle GPU

Mixture-of-experts model tuned for low-latency UI paths where only a small active parameter slice is used per token.

VRAM: ~21GB (Q4)

Context: 32,768 tokens

Low-latency UICost-sensitive routingLocal fallback demos

Qwen3.6-27B Routing Rule

task.type === "agent" || task.complexity === "medium-high"

Dense architecture + Thinking Preservation = stable multi-turn reasoning without MoE routing overhead.

Local fallback: when external API unavailable → Qwen3.6-27B GGUF via llama.cpp

Dense vs MoE	27B Dense	35B-A3B MoE
Active params	27B	3B
SWE-bench	77.2%	~72%
Inference speed	Moderate	Fast
VRAM (Q4)	~16.8GB	~21GB
Deployment	llama.cpp / LM Studio (GGUF)	API/Cloud
Best for	Agent loops	Low-latency UI