Diffusion Models Emerge as a New Approach to Large Language Models

Diffusion Models: A New Approach for Large Language Models

Large language models (LLMs) have made enormous progress in recent years and are increasingly shaping how we interact with information. Autoregressive models (ARMs) have long been considered the standard for developing such LLMs. A new approach based on diffusion models is now challenging this dominance and opening up exciting perspectives for the future of language processing.

How Diffusion Models Work

In contrast to ARMs, which generate text word by word, diffusion models are based on a two-stage process. In the first phase, the so-called forward process, the training data is gradually corrupted with noise until it eventually resembles pure noise. The second phase, the reverse process, then learns to transform this noise back into coherent data step by step. This process is controlled by a neural network, typically a transformer, which predicts the masked tokens.

By optimizing a likelihood bound, diffusion models offer a principled generative approach to probabilistic inference. This allows them to model the probability distribution of texts and thus generate more diverse and creative outputs.

LLaDA: An Example of a Large Language Diffusion Model

A promising example of this new approach is LLaDA, a diffusion model trained from scratch following the paradigm of pre-training and supervised fine-tuning (SFT). LLaDA shows strong scalability in various benchmarks and outperforms even sophisticated ARM baselines. Remarkably, LLaDA 8B can compete with strong LLMs like LLaMA 3 8B in terms of in-context learning and, after SFT, demonstrates impressive instruction-following capabilities in case studies such as multi-turn dialogues.

Furthermore, LLaDA addresses the so-called "reversal curse," a problem where models have difficulty processing reversed sequences. In a test for completing reversed poems, LLaDA even outperforms GPT-4o.

Potentials and Challenges

The results of research on diffusion models like LLaDA are promising and suggest that these models represent a viable and promising alternative to ARMs. They challenge the assumption that the most important capabilities of LLMs, such as in-context learning and instruction following, are inextricably linked to ARMs.

Despite the promising results, diffusion models are still in their early stages of development. Further research is necessary to fully exploit their potential and overcome the challenges associated with their application. These include, among others, the high computational cost during training and the optimization of sampling strategies.

Conclusion

The development of large language diffusion models like LLaDA represents a significant advance in research on large language models. These models offer an alternative architecture to the established ARMs and open up new possibilities for generating texts and modeling language. Although further research is needed, the results so far demonstrate the great potential of this technology and its influence on the future development of AI-based language models. The developments in this area will further advance the possibilities of AI partners like Mindverse, which develop customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems.

Bibliography: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li. "Large Language Diffusion Models." https://huggingface.co/papers/2502.09992 "High-Resolution Image Synthesis with Latent Diffusion Models." https://arxiv.org/abs/2112.10752 "Diffusion Models Beat GANs on Image Synthesis." https://arxiv.org/abs/2105.05233 "Large Language Models and Publicly Available Data." https://www.vde.com/resource/blob/2361636/bc0c7b6d8464dc8e285618b35b11caa7/paper---large-language-models-data.pdf "ELLA: Exploring Large Language Model Abilities through Architectural Innovation." https://ella-diffusion.github.io/ "Training Compute-Optimal Large Language Models." https://huggingface.co/papers/2406.11831 "Awesome Diffusion Models." https://github.com/diff-usion/Awesome-Diffusion-Models "Imagen Video: High Definition Video Generation with Diffusion Models." https://proceedings.neurips.cc/paper_files/paper/2023/file/fdba5e0a9b57fce03e89cc0cad0a24e9-Paper-Conference.pdf "Scaling Laws for Deep Learning based Image Reconstruction." https://openreview.net/forum?id=ks5lAv8QDn&referrer=%5Bthe%20profile%20of%20Yinpeng%20Dong%5D(%2Fprofile%3Fid%3D~Yinpeng_Dong2) "Scaling Vision with Sparse Mixture of Experts." https://dl.acm.org/doi/10.1145/3607540.3617144 ```