Improved Training for Latent Consistency Models Boosts Generative AI Performance

Improved Training for Latent Consistency Models

Consistency models have emerged as a promising alternative to diffusion models in generative AI. They enable the generation of high-quality images, text, and videos, both in a single step and in multiple steps. While consistency models in pixel space already achieve impressive results comparable to diffusion models, scaling the training on large datasets, especially for text-to-image and video generation tasks, poses a challenge in the latent space.

Recent research investigates the differences between pixel and latent space and identifies impulsive outliers in the latent space as the primary cause of performance degradation when training consistency models. These outliers negatively affect the stability and efficiency of the training. To address this problem, the researchers propose several optimizations.

Optimizations of the Training Process

A central aspect of the proposed improvements is the replacement of the Pseudo-Huber loss function with the Cauchy loss function. The Cauchy loss function is more robust to outliers and minimizes their influence on the training. Additionally, a diffusion loss function is introduced in early training steps to accelerate the model's convergence. The use of Optimal Transport (OT) Coupling improves the mapping of latent representations and further contributes to performance improvement.

Another important contribution is the introduction of the adaptive scaling c-scheduler. This scheduler dynamically controls the training process and adjusts the learning rate to the characteristics of the data. The integration of Non-Scaling LayerNorm into the model's architecture allows for a better capture of the statistical properties of the features and also reduces the influence of outliers.

Results and Outlook

By combining these strategies, latent consistency models could be trained that deliver high-quality results in just one or two steps. The performance gap between latent consistency and diffusion models is significantly reduced. These advancements open up new possibilities for the application of consistency models in areas such as image synthesis, text-to-image generation, and video editing. Future research could focus on further improving the robustness and efficiency of the training to ensure scalability to even larger datasets.

The implementation of the described improvements is publicly available and offers researchers and developers the opportunity to leverage the benefits of the optimized training method and conduct their own experiments. The further development of consistency models promises exciting innovations in the field of generative AI and could lead to new applications in various domains.

Bibliography: - Dao, Q., Doan, K., Liu, D., Le, T., & Metaxas, D. (n.d.). Improved Training Technique for Latent Consistency Models. - Kong, Z., Zhou, W., Xiang, S., & Zhang, C. (2024). ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10863–10872). - GUN. (n.d.). Awesome-Consistency-Models. Retrieved from https://github.com/G-U-N/Awesome-Consistency-Models - openvinotoolkit. (n.d.). latent-consistency-models-image-generation. Retrieved from https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/latent-consistency-models-image-generation/latent-consistency-models-image-generation.ipynb