LLM Router Demo

Real-time multi-model routing — see how intelligent routing achieves cost and latency savings

FinOps Routing Signal

Per-model cost values shown below are illustrative estimates from the configured token-rate table in `/api/llm-router`, used for relative comparison only.

Route intent: prefer lower-cost low-latency tiers for simple queries, then escalate to higher-capability models only when prompt complexity requires it.

Mid Tier Model Cards

Mid-tier routes balance coding quality, latency, and single-GPU deployment constraints.

reported benchmarks

Qwen 3.6 27B

mid tier · dense

dense
Apache 2.0262K ctxThinking PreservationFP8GGUF / Local

Dense 27B model with reported 77.2% SWE-bench Verified — flagship coding performance on single-GPU hardware. Native Thinking Preservation retains chain-of-thought across turns. Unsloth GGUF variant available for local deployment via llama.cpp or LM Studio — tool calling supported, no API dependency.

VRAM: ~16.8GB (Q4)

Context: 262,144 tokens

Agentic codingMulti-turn stabilityRepository-level reasoningFrontend generation
Local Deployment: Unsloth GGUF (Q4_K_M ~16.8GB VRAM)HuggingFace ↗

Qwen3-35B-A3B-int4

mid tier · moe

moe
MoEint43B activeSingle GPU

Mixture-of-experts model tuned for low-latency UI paths where only a small active parameter slice is used per token.

VRAM: ~21GB (Q4)

Context: 32,768 tokens

Low-latency UICost-sensitive routingLocal fallback demos

Qwen3.6-27B Routing Rule

task.type === "agent" || task.complexity === "medium-high"

Dense architecture + Thinking Preservation = stable multi-turn reasoning without MoE routing overhead.

Local fallback: when external API unavailable → Qwen3.6-27B GGUF via llama.cpp

Dense vs MoE27B Dense35B-A3B MoE
Active params27B3B
SWE-bench77.2%~72%
Inference speedModerateFast
VRAM (Q4)~16.8GB~21GB
Deploymentllama.cpp / LM Studio (GGUF)API/Cloud
Best forAgent loopsLow-latency UI