Aligning Representations: Using Autoregressive Language Models for Image Generation

Autoregressive Language Models for Image Generation: A New Approach through Representation Alignment

The generation of images from text descriptions is a fascinating field of Artificial Intelligence that has made enormous progress in recent years. A promising approach uses autoregressive language models (LLMs), which have primarily been used for text generation. These models generate text sequentially, word by word, based on the preceding words. The challenge lies in transferring this capability to image generation and thereby producing coherent and detailed images.

A new research approach, known as Autoregressive Representation Alignment (ARRA), promises to significantly expand the possibilities of autoregressive LLMs for image generation. In contrast to previous approaches, which often required complex changes to the architecture of the models, ARRA takes a different path. The core of ARRA lies in aligning the internal representations of the language model with visual representations from external, pre-trained image models. This alignment is achieved through a global visual alignment loss function that maximizes the similarity between the representations.

Another important element of ARRA is the introduction of a special token, . This token fulfills two tasks. First, it controls the local prediction of the next token, as is usual for autoregressive models. Second, it serves global semantic distillation by promoting the transfer of knowledge from the visual model to the language model. Through this dual function, the language model implicitly learns to consider spatial and contextual coherence in the generated images without requiring changes to the fundamental autoregressive architecture.

Extensive experiments demonstrate the versatility and effectiveness of ARRA. Both in training pure text generation LLMs and in training models with random initialization, ARRA was able to significantly improve the quality of the generated images. Measured by the Fréchet Inception Distance (FID), a common quality metric for generated images, improvements of up to 25.5% were achieved on various datasets. Particularly noteworthy is that these improvements were achieved without changes to the architecture of the LLMs used, which underscores the easy applicability of ARRA.

Furthermore, ARRA is also suitable for domain adaptation, i.e., the adaptation of general LLMs to specific application areas. For example, by aligning a general LLM to a specialized medical image model, a significant improvement in image generation in the medical context could be achieved. These results suggest that ARRA is a promising approach to unlock the power of autoregressive LLMs for image generation in various applications.

ARRA shows that not only architectural innovations but also the redesign of training objectives are crucial for solving challenges in multimodal generation. This approach offers a complementary perspective for the further development of autoregressive models and opens up new possibilities for the generation of images from text descriptions.

For Mindverse, a German provider of AI-powered content solutions, this development is of particular interest. The integration of ARRA into the platform could expand the possibilities of image generation and provide users with new creative tools. From the creation of marketing materials to the development of personalized content, the combination of text and image offers enormous potential. The further development of technologies like ARRA contributes to unlocking this potential and pushing the boundaries of what is possible in AI-powered content creation.

```

Aligning Representations: Using Autoregressive Language Models for Image Generation

Top post

Autoregressive Language Models for Image Generation: A New Approach through Representation Alignment

Related blog

Multi-Turn Jailbreaks and Defenses: Enhancing LLM Security

Off-Policy Learning Enhances Reasoning Abilities in AI Models

SphereDiff Generates Seamless 360° Panoramas Without Finetuning