Real-time multi-model routing — see how intelligent routing achieves cost and latency savings
Per-model cost values shown below are illustrative estimates from the configured token-rate table in `/api/llm-router`, used for relative comparison only.
Route intent: prefer lower-cost low-latency tiers for simple queries, then escalate to higher-capability models only when prompt complexity requires it.
Mid-tier routes balance coding quality, latency, and single-GPU deployment constraints.
Qwen 3.6 27B
mid tier · dense
Dense 27B model with reported 77.2% SWE-bench Verified — flagship coding performance on single-GPU hardware. Native Thinking Preservation retains chain-of-thought across turns. Unsloth GGUF variant available for local deployment via llama.cpp or LM Studio — tool calling supported, no API dependency.
VRAM: ~16.8GB (Q4)
Context: 262,144 tokens
Qwen3-35B-A3B-int4
mid tier · moe
Mixture-of-experts model tuned for low-latency UI paths where only a small active parameter slice is used per token.
VRAM: ~21GB (Q4)
Context: 32,768 tokens
Qwen3.6-27B Routing Rule
task.type === "agent" || task.complexity === "medium-high"
Dense architecture + Thinking Preservation = stable multi-turn reasoning without MoE routing overhead.
Local fallback: when external API unavailable → Qwen3.6-27B GGUF via llama.cpp
| Dense vs MoE | 27B Dense | 35B-A3B MoE |
|---|---|---|
| Active params | 27B | 3B |
| SWE-bench | 77.2% | ~72% |
| Inference speed | Moderate | Fast |
| VRAM (Q4) | ~16.8GB | ~21GB |
| Deployment | llama.cpp / LM Studio (GGUF) | API/Cloud |
| Best for | Agent loops | Low-latency UI |