Bit Precision
32-bit Float
Full precision (FP32)
~24 cyc/mul
16-bit Fixed
Q8.8 format
~12 cyc/mul
8-bit Integer
INT8 quantized
~6 cyc/mul
4-bit Integer
Extreme quantization
~3 cyc/mul
2-bit (Ternary)
-1, 0, +1 only
~1 cyc/mul
Original Value (FP32)
0.7350
Quantized Value
0.7344
Error: 0.08%
Original (hex)
0x3F3C28F6
Quantized
0x00BC
Levels available
65,536
Quantization Levels on Number Line [-1, 1]
Quantization Error
32-bit
Full Precision
Levels:
~10⁹
Cycles:
24/mul
Max Error:
~0%
16-bit
Q8.8 Fixed
Levels:
65,536
Cycles:
12/mul
Max Error:
0.2%
8-bit
INT8
Levels:
256
Cycles:
6/mul
Max Error:
0.4%
4-bit
INT4
Levels:
16
Cycles:
3/mul
Max Error:
6.25%
Multiplication Example: weight × activation
0.735
×
0.892
=
0.656
Exact Result (FP32)
0.655620
Quantized Result
0.654297
Error: 0.20%
Performance vs Accuracy Tradeoff (MLP Forward Pass)
FP32
990 cyc
0.00%
Q8.8
495 cyc
0.15%
INT8
248 cyc
0.40%
INT4
124 cyc
2.50%
Ternary
41 cyc
8.00%
💡 Why Quantization Matters
Modern LLMs use aggressive quantization to run on consumer hardware.
• GGUF Q4_K_M (popular for local LLMs): 4-bit weights with some 6-bit outliers
• AWQ/GPTQ: 4-bit with calibration to minimize accuracy loss
• 1-bit LLMs (BitNet): Microsoft research shows ternary weights can match FP16 quality at scale
At 150 Hz, going from 16-bit to 4-bit cuts our MLP inference from 3.3 seconds to 0.8 seconds—a 4× speedup. The tradeoff is ~2.5% accuracy loss, which often matters less than you'd expect in practice.
This is active research territory. Finding the minimum precision that preserves model quality is one of the most important problems in efficient AI deployment.
• GGUF Q4_K_M (popular for local LLMs): 4-bit weights with some 6-bit outliers
• AWQ/GPTQ: 4-bit with calibration to minimize accuracy loss
• 1-bit LLMs (BitNet): Microsoft research shows ternary weights can match FP16 quality at scale
At 150 Hz, going from 16-bit to 4-bit cuts our MLP inference from 3.3 seconds to 0.8 seconds—a 4× speedup. The tradeoff is ~2.5% accuracy loss, which often matters less than you'd expect in practice.
This is active research territory. Finding the minimum precision that preserves model quality is one of the most important problems in efficient AI deployment.