Recent advancements in Large Language Models (LLMs) have demonstrated their remarkable problem-solving capabilities in various fields. These models contain an extensive number of parameters and are trained on vast text datasets.
Studies reveal that memory bandwidth, rather than CPU, is the primary bottleneck for generative tasks in LLM inference. This means that the speed at which parameters can be loaded and stored for memory-bound situations becomes the key latency barrier. However, the progress in memory bandwidth technology has not kept up with computation, leading to the emergence of the Memory Wall.
Quantization, a promising method, involves storing model parameters with lower accuracy than the usual 16 or 32 bits used during training. Despite recent developments like LLaMA and its instruction-following variations, achieving good quantization performance, especially with lower bit precision and relatively modest models, remains challenging.
A new study from UC Berkeley addresses this issue by thoroughly examining low-bit precision quantization and identifying the limitations of current methods. Based on their findings, the researchers introduce SqueezeLLM, a post-training quantization framework that combines a Dense-and-Sparse decomposition technique with a sensitivity-based non-uniform quantization strategy. These methods enable quantization with ultra-low-bit precision while maintaining competitive model performance, significantly reducing model sizes and inference time costs. For example, their method reduces the perplexity of the LLaMA-7B model at 3-bit precision from 28.26 with uniform quantization to 7.75 on the C4 dataset, a considerable improvement.
In extensive testing on the C4 and WikiText2 benchmarks, SqueezeLLM consistently outperforms existing quantization approaches across different bit precisions for language modeling tasks using LLaMA-7B, 13B, and 30B.
According to the research team, low-bit precision quantization of many LLMs is particularly challenging due to outliers in the weight matrices. These outliers also affect their non-uniform quantization approach by biasing the allocation of bits towards extremely high or low values. To address this, they propose a simple method that divides the model weights into dense and sparse components. By isolating the extreme values, the central region displays a narrower range, resulting in better quantization precision. The sparse data can be stored efficiently using methods like Compressed Sparse Rows (CSR), with low overhead by parallelizing the computation alongside the dense part.
To demonstrate the potential of their framework, the team applied SqueezeLLM to the Vicuna-7B and 13B models, comparing them with GPTQ and AWQ, two state-of-the-art approaches, using the MMLU dataset and GPT-4 to assess the quality of the generated output. In both benchmarks, SqueezeLLM consistently outperformed the other approaches, with the 4-bit quantized model performing as well as the baseline.
The researchers also showcased significant latency reductions and improvements in quantization performance when running their models on A6000 GPUs. They achieved speedups of up to 2.3 compared to the baseline FP16 inference for LLaMA-7B and 13B, as well as up to 4x faster latency than GPTQ, demonstrating the efficiency of their method in quantization performance and inference efficiency.
For more details, you can read the [paper](https://ift.tt/O4kNSb7) and visit the [Github](https://ift.tt/2IedMSD) page of SqueezeLLM. Don’t forget to join our 24k+ ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news, cool projects, and more. If you have any questions or want to provide feedback, feel free to email us at Asif@marktechpost.com.
### Featured Tools From AI Tools Club
– [Check Out 100’s AI Tools in AI Tools Club](https://pxl.to/ydl0hc)
### Editor Notes
In this article, we explored the advancements in low-bit precision quantization for Large Language Models (LLMs) and how SqueezeLLM offers a promising solution to improve quantization performance and inference efficiency. The researchers at UC Berkeley have made significant progress in reducing model sizes and latency while maintaining competitive model performance. This research opens up new possibilities for deploying LLMs in resource-constrained environments without sacrificing performance. It will be exciting to see how these developments contribute to the field of natural language processing and enable the widespread adoption of LLMs. For more AI news and updates, make sure to visit [GPT News Room](https://gptnewsroom.com).
Source link
from GPT News Room https://ift.tt/Z2xB1EY
No comments:
Post a Comment