MixLLM Optimizes LLM Performance and Memory Efficiency

LLM Quantization: MixLLM Optimizes Memory Footprint and System Performance

Large language models (LLMs) have made impressive progress in recent years, but their high memory requirements and enormous computational demands pose a challenge for efficient deployment. Quantization has proven to be one of the most effective methods for compressing LLMs and reducing memory footprint. This involves representing the model's weights and/or activations with lower bit widths. However, existing quantization solutions show limitations regarding accuracy loss or system efficiency.

For efficient LLM quantization, three factors must be considered: accuracy, parameter memory footprint, and execution system efficiency. These three characteristics form the so-called effectiveness triangle of quantization. Existing quantization solutions have different focuses and offer various trade-offs within this triangle.

Weight-based methods focus on reducing memory footprint and can improve execution speed with small batch sizes. However, the accuracy loss can be significant with 4-bit quantization, especially for newer models with higher information density. Furthermore, the weight-based method can lead to performance degradation with large batch sizes.

In weight-activation quantization, both weights and activations are represented with low bit widths, potentially leading to higher system efficiency. However, this can lead to greater accuracy losses, as quantizing activations is generally more challenging. Additionally, there is a higher overhead for dequantizing the activations, which can negatively impact system efficiency.

Outlier separation and mixed-precision technologies attempt to improve the accuracy of low-bit quantization by either excluding high-importance weights from quantization or assigning them a larger bit width. However, the former leads to system efficiency issues due to the low efficiency of processing sparse tensors in half-float format. While state-of-the-art mixed-precision solutions achieve low-bit quantization, they still exhibit a non-negligible accuracy loss.

MixLLM: A New Approach

MixLLM takes a new approach that leverages mixed-precision quantization between output features on the weights, based on the observation that different output features contribute differently to the model output. MixLLM identifies the high-importance output features globally and assigns them a larger bit width (8-bit), while other features are quantized with 4-bits. In contrast to previous approaches, which use a uniform number of outliers within each layer, MixLLM identifies the importance of the output features globally, based on the estimated loss for the model output.

To achieve both high accuracy and good system efficiency, MixLLM uses 8-bit for activation quantization. Since MatMul execution tends to be more constrained by the larger weight tensor than the smaller activation tensor, the need to further reduce the activations is less. MixLLM uses symmetric quantization for 8-bit and asymmetric quantization for 4-bit, each group-wise. This configuration poses a challenge to system efficiency. To address this challenge, MixLLM utilizes a two-stage dequantization that enables the use of fast Int8 tensor cores, as well as fast integer-float conversion to reduce the dequantization overhead. Additionally, an optimized software pipeline design for the quantized linear kernel on modern GPUs is presented.

Extensive experiments show that MixLLM, with only a 10% 8-bit proportion, outperforms all existing 4-bit quantization algorithms while achieving state-of-the-art system efficiency.

Bibliography: https://arxiv.org/abs/2412.14590 https://arxiv.org/html/2412.14590v1 https://openreview.net/pdf/416ad21ac920e9c8d80ca9d3a917483a5781a537.pdf https://openreview.net/forum?id=zi0XgnZlcl https://x.com/Memoirs/status/1870115149337489463 https://proceedings.mlsys.org/paper_files/paper/2024/file/5edb57c05c81d04beb716ef1d542fe9e-Paper-Conference.pdf https://bohrium.dp.tech/paper/arxiv/2411.16158 https://nicsefc.ee.tsinghua.edu.cn/%2Fnics_file%2Fpdf%2F5c805adc-b555-499f-9882-5ca35ce674b5.pdf https://proceedings.mlsys.org/paper_files/paper/2024/file/136b9a13861308c8948cd308ccd02658-Paper-Conference.pdf https://dai.sjtu.edu.cn/my_file/pdf/f1c6a4bb-4556-43e2-8e46-4ab38d8bfed4.pdf