VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Abstract
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.
Community
VPTQ (Vector Post-Training Quantization) is an advanced compression technique that dramatically reduces the size of large language models such as the 70B and 405B Llama models. VPTQ efficiently compresses these models to 1-2 bits within just a few hours, enabling them to run effectively on GPUs with limited memory.
- paper https://arxiv.org/abs/2409.17066
- github https://github.com/microsoft/VPTQ
- hugginface community https://huggingface.co/VPTQ-community
- free huggingface onlie demo/space https://huggingface.co/spaces/VPTQ-community/VPTQ-Demo
Llama 3.1 70b chat on RTX4090 (24G @ 2bit)
Llama 3.1 70b prompt on RTX4090 (24G @ 2bit)
in the tables, for example table 2, you have highlighted the best values where VPTQ beats other quantization methods, but you did not highlight the highest values where other methods were better. It would be a lot better if you'd highlight the highest values everywhere instead of giving VPTQ preferential treatment by only highlighting the highest values if they are from your method :)
also just a small thing on the side for clarity, maybe changing unit descriptions from something like mem/GB
, cost/h
to mem (GB)
, cost (h)
would help a bit with understandability. I was confused at first at mem/GB
because i thought it meant "memory per gigabyte".
There are also some other text issues, like the duplicate sentence at the top of page 3: " Un-
der the guidance of the optimization problem, Under the guidance of the optimization problem".
content wise though, looks like super great work!
Thanks for your suggestion. our paper reviewer also points out the highlights and typos in the table. And we will fix this in our camera-ready version. : -)
The current tech report is an early version that introduces our methods and early results. Thanks for your kind suggestion!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms (2024)
- P4Q: Learning to Prompt for Quantization in Visual-language Models (2024)
- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models (2024)
- Foundations of Large Language Model Compression -- Part 1: Weight Quantization (2024)
- VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 65
Browse 65 models citing this paperDatasets citing this paper 0
No dataset linking this paper