Encoder-Decoder Models: A Resurgence in Efficiency for Smaller Language Models

Encoder-Decoder Models: A Renaissance of Efficiency in the Realm of Small Language Models

In the current landscape of Artificial Intelligence, large, decoder-based language models dominate. However, despite the ongoing trend of scaling these models, an alternative architecture is returning to the spotlight: the encoder-decoder model. Particularly in the realm of small language models (SLMs), meaning models with up to one billion parameters, encoder-decoder architectures demonstrate surprising advantages in terms of efficiency and performance, making them particularly attractive for resource-constrained environments.

Efficiency Advantages on Various Platforms

A systematic analysis across various hardware platforms, including GPUs, CPUs, and NPUs, illustrates the strengths of encoder-decoder models compared to their decoder-based counterparts. For example, with small language models on edge devices, a 47% lower latency for the first token and a 4.7 times higher throughput were achieved. These efficiency gains can be attributed to the one-time processing of the input by the encoder and the clear separation of the understanding and generation phases. Decoder-based models, on the other hand, process the input repeatedly, which leads to increased computational effort, especially with longer input sequences.

Knowledge Distillation: Learning from the Big Ones

An often-cited argument for decoder-based models is their ability to learn from large, scalable teachers. However, new developments in the field of knowledge distillation now also allow encoder-decoder models to benefit from the capabilities of large decoder-based models without sacrificing their architectural advantages. Through this method, performance increases of up to 6 points were achieved across various tasks, particularly in asymmetric sequence tasks, where input and output distributions benefit from different processing approaches.

Scalability and Performance

Studies on scaling behavior show that the performance advantage of encoder-decoder models becomes even greater with an increasing number of parameters up to one billion. In the range of 330 million to one billion parameters, encoder-decoder models were able to maintain a consistent lead of 6-7% over decoder-based models. These results suggest that the choice of architecture becomes more important with a decreasing parameter budget, especially for on-device and edge deployments, where computational efficiency plays a crucial role.

Transfer to Vision-Language Tasks

The advantages of the encoder-decoder architecture are not limited to text processing. Significant performance improvements were also achieved in vision-language tasks, i.e., tasks that process both image and text data. For example, improvements of 11.2% on VQAv2, 8.2% on TextVQA, and 7.3% on ChartQA were observed, without sacrificing the efficiency advantages. The combination of encoder-decoder models with modern approaches such as Rotary Positional Embeddings (RoPE) and vision encoders opens up new possibilities for the use of powerful language models in resource-constrained environments.

The Supposed Bottleneck

The assumption of a performance bottleneck in encoder-decoder models beyond a certain number of parameters is challenged by the current results. While T5 has already shown impressive results with 20 billion parameters, the actual limit of this supposed bottleneck remains unclear. This limitation might even be an advantage by forcing models to learn more efficient representations instead of relying on brute-force scaling. Future research could further explore the theoretical limits of this architecture by integrating residual connections between the encoder and decoder while preserving the efficiency benefits.

Conclusion

The results of these studies demonstrate that encoder-decoder models, particularly in the realm of small language models, represent a powerful and efficient alternative to the dominant decoder-based models. Especially in resource-constrained environments, where efficiency plays a central role, they offer decisive advantages. The renaissance of the encoder-decoder architecture underscores the importance of architectural choice in the development of language models and opens new avenues for the use of AI in a variety of applications.

Bibliographie: https://arxiv.org/abs/2501.16273 https://arxiv.org/pdf/2501.16273? https://paperreading.club/page?id=280261 https://x.com/kellerjordan0/status/1884163463733469293 https://aclanthology.org/volumes/2024.naacl-long/ https://github.com/azminewasi/Awesome-LLMs-ICLR-24 https://learnopencv.com/simsiam/ https://www.chatpaper.com/chatpaper/fr?id=3&date=1737993600&page=1 https://www.researchgate.net/publication/369855320_SLM_End-to-end_Feature_Selection_via_Sparse_Learnable_Masks https://mediatum.ub.tum.de/doc/1700589/1700589.pdf