Shrinking a neural network's weights to lower precision (float16 → int8 → int4) to run it on cheaper hardware with minimal quality loss. Essential for on-device LLMs.
"Quantized the 70B model down to 4-bit. Runs on my MacBook."
No comments yet — say something.
Add your own interpretation of "quantization".