Quantization Explorer

Bit Precision vs Cycles vs Accuracy • The Fundamental Tradeoff
Bit Precision
32-bit Float
Full precision (FP32)
~24 cyc/mul
16-bit Fixed
Q8.8 format
~12 cyc/mul
8-bit Integer
INT8 quantized
~6 cyc/mul
4-bit Integer
Extreme quantization
~3 cyc/mul
2-bit (Ternary)
-1, 0, +1 only
~1 cyc/mul
0.7350
0.7344
Error: 0.08%
Original (hex) 0x3F3C28F6
Quantized 0x00BC
Levels available 65,536
Quantization Levels on Number Line [-1, 1]
Quantization Error
0.08%
32-bit
Full Precision
Levels: ~10⁹
Cycles: 24/mul
Max Error: ~0%
16-bit
Q8.8 Fixed
Levels: 65,536
Cycles: 12/mul
Max Error: 0.2%
8-bit
INT8
Levels: 256
Cycles: 6/mul
Max Error: 0.4%
4-bit
INT4
Levels: 16
Cycles: 3/mul
Max Error: 6.25%
Multiplication Example: weight × activation
0.735 × 0.892 = 0.656
Exact Result (FP32)
0.655620
Quantized Result
0.654297
Error: 0.20%
Performance vs Accuracy Tradeoff (MLP Forward Pass)
FP32
990 cyc 0.00%
Q8.8
495 cyc 0.15%
INT8
248 cyc 0.40%
INT4
124 cyc 2.50%
Ternary
41 cyc 8.00%
💡 Why Quantization Matters
Modern LLMs use aggressive quantization to run on consumer hardware.

GGUF Q4_K_M (popular for local LLMs): 4-bit weights with some 6-bit outliers
AWQ/GPTQ: 4-bit with calibration to minimize accuracy loss
1-bit LLMs (BitNet): Microsoft research shows ternary weights can match FP16 quality at scale

At 150 Hz, going from 16-bit to 4-bit cuts our MLP inference from 3.3 seconds to 0.8 seconds—a 4× speedup. The tradeoff is ~2.5% accuracy loss, which often matters less than you'd expect in practice.

This is active research territory. Finding the minimum precision that preserves model quality is one of the most important problems in efficient AI deployment.
🧠
Your Brain Has Performed
0
operations since arriving