Compare FP32 vs INT8 inference — real benchmarks in your browser
Local-First Inference Posture
Privacy: Prompt text is processed in-browser for this demo after model download.
Efficiency: INT8 reduces memory footprint and improves responsiveness on constrained hardware.
Trade-off: Local models improve privacy/cost posture, while frontier server models remain useful for harder tasks.
Run real inference benchmarks comparing FP32 and INT8 quantized models in your browser.
No API key required. Models downloaded on first use
Real-world MoE quantization example · Intel AutoRound on Qwen3 architecture
Architecture
MoE
Mixture of Experts
Total / Active Params
35B / ~3B active
per token
Quantization Method
AutoRound
group_size=128
Precision
INT4
Intel AutoRound
Key Trade-off
MoE int4 = enterprise reasoning at edge compute cost. Only active expert weights are loaded per token — fundamentally different from dense model quantization.
vs. Dense INT8
Dense INT8 compresses all weights uniformly. MoE int4 activates only ~3B of 35B weights per token — inference cost scales with active experts, not total parameters.
Deployment Profile
Single GPU capable (served via vllm). AutoRound group_size=128 preserves weight distribution accuracy while achieving ~8× compression vs FP32.