High performance low-bit layer-wise weight only quantization on a laptop


Intel researchers have developed a method to reduce the size of Large Language Models (LLMs) through post-training quantization, enabling low-latency inference. The Layer-wise Low-bit Weight Only Quantization technique allows users to quantize LLMs on memory-constrained devices like laptops. The method, which can reduce the memory footprint of LLMs by 80-90%, has been integrated into Intel’s Neural Compressor tool.
Read more at Medium…