Expert Race: Scaling Diffusion Transformers with a Novel Mixture of Experts Approach

Diffusion Transformer with Mixture of Experts: A New Approach to Scaling through "Expert Race"

Diffusion models have established themselves as the leading method in visual generation. The use of Mixture of Experts (MoE) expands these models by improving scalability and performance. A promising approach in this area is "Race-DiT," a novel MoE model for Diffusion Transformer that uses a flexible routing strategy called "Expert Race."

Traditional MoE models assign tokens to specific experts based on a learned routing function. "Expert Race," on the other hand, enables a more dynamic approach. Both tokens and experts compete in a "race," with the best candidates being selected. This allows the model to learn to assign experts to the most relevant tokens, thus using computing power more efficiently. This approach focuses expertise on the critical areas of the image and enables more detailed and accurate generation.

However, the development of MoE models for Diffusion Transformer also presents challenges, particularly in the shallower layers of the network. To counteract this, the developers of Race-DiT introduce layer-wise regularization. This regularization promotes learning in the early layers and contributes to creating a more stable and powerful overall model.

Another problem with MoE models is the so-called "Mode Collapse," where individual experts become overly specialized, and other experts remain unused. To prevent this, Race-DiT introduces a "Router Similarity Loss." This loss term promotes a more even distribution of tasks among the experts and ensures that the full potential of the MoE approach is exploited.

Initial results on the ImageNet dataset show promising results. Race-DiT achieves significant performance improvements compared to existing models while demonstrating good scaling properties. This suggests that this approach could be an important step towards developing even more powerful and efficient diffusion models for visual generation.

The combination of Diffusion Transformers with MoE and the innovative "Expert Race" strategy opens up new possibilities for scaling generative models. The targeted assignment of experts to the most important image areas allows for more efficient use of computing power and leads to improved image quality. Future research will show to what extent this approach can further revolutionize the field of visual generation.

Potential and Future Research

Research in the field of MoE-based Diffusion Transformer is still in its early stages, but the results so far are promising. The flexible routing strategy "Expert Race" offers high potential for scaling these models and could pave the way for more complex and detailed image generation. Future research could focus on optimizing the routing strategy, improving layer-wise regularization, and investigating further application areas.

Significance for Companies like Mindverse

For companies like Mindverse, which specialize in AI-powered content creation, these developments are of great importance. More efficient and scalable diffusion models can significantly improve the quality and diversity of generated content and open up new possibilities for creative applications. Integrating technologies like Race-DiT into Mindverse's product range could help further strengthen the company's position as a leading provider of AI solutions in the field of content creation.

Bibliographie: Yuan, Y., Wang, Z., Huang, Z., Zhu, D., Zhou, X., Yu, J., & Min, Q. (2025). Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts. arXiv preprint arXiv:2503.16057. Du, Y., Li, T., & Zhou, D. (2024). DiT-MoE: Mixture-of-Experts for Diffusion Transformers. arXiv preprint arXiv:2412.12953. Geffner, T., & Sutton, R. S. (2024). A Formal Theory of Goal-Directed Learning. arXiv preprint arXiv:2410.02098v5. Laskin, M., Wang, K., & Courville, A. (2023). Efficient Diffusion Transformer Policies with Mixture-of-Expert Denoisers for Multitask Learning. OpenReview. feizc/DiT-MoE. (n.d.). GitHub. Retrieved from https://github.com/feizc/DiT-MoE Efficient Diffusion Transformer Policies with Mixture-of-Expert Denoisers for Multitask Learning. (n.d.). The Moonlight. Retrieved from https://www.themoonlight.io/fr/review/efficient-diffusion-transformer-policies-with-mixture-of-expert-denoisers-for-multitask-learning mathfinder/arxiv:2503.16057. (n.d.). Hugging Face. Retrieved from https://huggingface.co/papers/2412.12953 ICML 2024 - Proceedings of the 41st International Conference on Machine Learning. (n.d.). Retrieved from https://icml.cc/virtual/2024/papers.html wangkai930418/awesome-diffusion-categorized. (n.d.). GitHub. Retrieved from https://github.com/wangkai930418/awesome-diffusion-categorized Nunes, N. J., et al. (2023). Artificial intelligence in drug discovery and development. Drug Discovery Today, 28(6), 1001–1006.