OmniQuant: Achieving Omnidirectional Calibration in Large Language Models. (arXiv:2308.13137v1 [cs.LG])

Large language models (LLMs) have significantly transformed natural language processing tasks. However, their practical implementation is hindered by their extensive memory and computation requirements. While recent post-training quantization (PTQ) methods have proven effective in reducing memory usage and improving computational efficiency, they often rely on manually defined quantization parameters, leading to suboptimal performance and inadequate support for extremely low-bit quantization.

To address these challenges, we propose an innovative technique called Omnidirectionally calibrated Quantization (OmniQuant) for LLMs. OmniQuant is designed to achieve high performance across diverse quantization settings while maintaining the computational efficiency of PTQ. It achieves this by efficiently optimizing various quantization parameters through two key components: Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET).

LWC adjusts the extreme weight values by optimizing the clipping threshold, effectively modulating the quantization process. On the other hand, LET addresses activation outliers by shifting the quantization challenge from activations to weights through a learnable equivalent transformation. Both components operate within a differentiable framework using block-wise error minimization, enabling efficient optimization for both weight-only and weight-activation quantization.

For example, using OmniQuant on a single A100-40G GPU with 128 samples, the LLaMA-2 model family with a size of 7-70B can be processed within 1-16 hours. Extensive experiments demonstrate the superior performance of OmniQuant across diverse quantization configurations such as W4A4, W6A6, W4A16, W3A16, and W2A16. Furthermore, OmniQuant proves effective in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices.

The implementation of OmniQuant, including codes and models, is available at the following URL: [https://github.com/OpenGVLab/OmniQuant](https://github.com/OpenGVLab/OmniQuant).