TransMLA: Converting Large Language Models from GQA to MLA for Efficiency

Multi-Head Latent Attention (MLA): An Efficient Approach for Large Language Models

Modern large language models (LLMs) often encounter hardware-related communication bottlenecks during their execution, which limit performance more than pure computing power. The size of the Key-Value (KV) cache, required for storing intermediate values in the attention mechanism, plays a crucial role in this. Multi-Head Latent Attention (MLA) addresses this challenge by using low-rank matrices in the KV layers. This allows compressed latent KV states to be cached, significantly reducing the size of the KV cache compared to traditional Multi-Head Attention and leading to faster inference.

To maintain the model's expressiveness despite the compression, MLA uses a so-called up-projection matrix. This approach trades additional computational effort for a reduced communication load. Although MLA has demonstrated its efficiency and effectiveness in Deepseek V2/V3/R1, many large model providers continue to rely on Group Query Attention (GQA) and have not announced any plans to introduce MLA.

TransMLA: Bridging the Gap Between GQA and MLA

Recent research shows that GQA can always be represented by MLA without increasing the KV cache overhead, while the reverse is not true. This suggests that MLA is a more flexible and potentially more powerful approach. To promote the wider application of MLA, TransMLA was developed, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can be further trained to increase expressiveness without increasing the size of the KV cache.

The developers of TransMLA also plan to develop MLA-specific inference acceleration techniques to maintain low latency in transformed models, enabling more efficient distillation from Deepseek R1.

Advantages and Potential of MLA

The use of MLA offers several advantages: - Reduced KV cache requirement: Compressing the KV states significantly reduces memory requirements, which is particularly beneficial for large models. - Faster inference: The lower communication load leads to faster processing and thus faster inference. - Flexibility and expressiveness: MLA can represent GQA but offers additional opportunities for optimization and adaptation. - Potential for further optimizations: The development of MLA-specific acceleration techniques promises further performance improvements.

MLA and TransMLA represent a promising approach to address the challenges of scaling large language models. The ability to convert existing GQA-based models into MLA models opens up new avenues for optimization and efficiency gains and could enable the development and application of even more powerful LLMs.

Bibliography: - https://huggingface.co/papers/2502.07864 - https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf - https://arxiv.org/abs/1706.03762 - https://medium.com/towards-data-science/deepseek-v3-explained-1-multi-head-latent-attention-ed6bee2a67c4 - https://www.semanticscholar.org/paper/Attention-is-All-you-Need-Vaswani-Shazeer/204e3073870fae3d05bcbc2f6a8e263d9b72e776 - https://medium.com/@redbeet1007/paper-review-attention-is-all-you-need-vaswani-2017-1d79b986cccf - https://horasis.org/deepseeks-multi-head-latent-attention-method/ - https://arxiv.org/html/1706.03762v7 - https://www.researchgate.net/publication/362306578_Attention_Is_All_You_Need_to_Tell_Transformer-Based_Image_Captioning