Model Quantization Demo

Compare FP32 vs INT8 inference — real benchmarks in your browser

Runs on devicePrivacy-preserving local inference

Back

Local-First Inference Posture

Privacy: Prompt text is processed in-browser for this demo after model download.

Efficiency: INT8 reduces memory footprint and improves responsiveness on constrained hardware.

Trade-off: Local models improve privacy/cost posture, while frontier server models remain useful for harder tasks.

Run real inference benchmarks comparing FP32 and INT8 quantized models in your browser.

No API key required. Models downloaded on first use

Real-world Quantization Reference

Qwen3.6-35B-A3B — Intel AutoRound

Real-world MoE quantization example · Intel AutoRound on Qwen3 architecture

int4

Architecture

MoE

Mixture of Experts

Total / Active Params

35B / ~3B active

per token

Quantization Method

AutoRound

group_size=128

Precision

INT4

Intel AutoRound

Key Trade-off

MoE int4 = enterprise reasoning at edge compute cost. Only active expert weights are loaded per token — fundamentally different from dense model quantization.

vs. Dense INT8

Dense INT8 compresses all weights uniformly. MoE int4 activates only ~3B of 35B weights per token — inference cost scales with active experts, not total parameters.

Deployment Profile

Single GPU capable (served via vllm). AutoRound group_size=128 preserves weight distribution accuracy while achieving ~8× compression vs FP32.