Guidance-Free Training: Efficient Image Generation without Guidance

Visual Generation Reimagined: Moving Away from Guidance, Towards Efficiency

AI-powered image generation has made tremendous strides in recent years. A common method, Classifier-Free Guidance (CFG), utilizes the inference of both conditional and unconditional models during the sampling process. While this allows for precise control over image generation, it also doubles the computational cost. A new approach, Guidance-Free Training (GFT), promises to achieve the same performance with half the computational requirements.

The Problem with Guidance

CFG has established itself as a standard technique in various visual generative models. However, the process requires the calculation of two separate models: a conditional model that responds to specific inputs, and an unconditional model that generates freely. The combination of the results of both models enables control over the generation process. This double computational effort, however, places a significant burden on computational resources and slows down the generation process.

Guidance-Free Training: A New Path

GFT circumvents this problem by dispensing with guidance from two models. Instead, GFT trains a single model that is directly optimized for the desired results. This significantly reduces the computational cost during the inference phase. In contrast to previous distillation-based approaches that rely on pre-trained CFG networks, GFT allows training directly from scratch. This eliminates the need to first train a complex CFG model before it can be simplified.

Simple Implementation, Big Impact

The implementation of GFT is surprisingly simple. The algorithm retains the same maximum likelihood objective function as CFG and differs primarily in the parameterization of the conditional models. Developers can implement GFT with minimal changes to existing codebases, as most design decisions and hyperparameters can be adopted directly from CFG. This significantly reduces the effort required to integrate GFT into existing projects.

Versatile Application and Convincing Results

Extensive experiments with five different visual models demonstrate the effectiveness and versatility of GFT. In the areas of diffusion, autoregressive, and masked prediction models, GFT consistently achieves comparable or even lower FID (Fréchet Inception Distance) scores with similar diversity and precision ratios compared to CFG baseline models. And all this without relying on guidance.

Outlook

GFT represents a promising approach for efficient image generation. The simple implementation and compelling results make GFT an attractive alternative to CFG, especially in resource-constrained environments. Future research could focus on further optimizing GFT and applying it to other areas of image generation. The ability to train directly from scratch also opens up new possibilities for the development of specialized models tailored to specific requirements.

Bibliography: - https://arxiv.org/abs/2406.07540 - https://arxiv.org/abs/2310.07702 - https://genforce.github.io/ctrl-x/ - https://neurips.cc/virtual/2024/poster/94606 - https://openreview.net/pdf/0573286dcadef74f4b3f5b64bdc13a5c352c44e1.pdf - https://taohu.me/sgdm/ - https://papers.nips.cc/paper_files/paper/2023/file/862f45ccecb2275851bc8acebb8b4d65-Paper-Conference.pdf - https://cdn.openai.com/papers/dall-e-2.pdf - https://openaccess.thecvf.com/content/CVPR2023W/GCV/papers/Shi_Exploring_Compositional_Visual_Generation_With_Latent_Classifier_Guidance_CVPRW_2023_paper.pdf - https://openreview.net/forum?id=ZulWEWQOp9&referrer=%5Bthe%20profile%20of%20Fangzhou%20Mu%5D(%2Fprofile%3Fid%3D~Fangzhou_Mu1) ```