InstantCharacter: A New Framework for Personalized Character Image Generation

Personalized Image Generation: InstantCharacter Sets a New Standard

The personalized creation of images, especially of characters, has made significant progress in recent years through the use of AI-based methods. However, previous methods, mainly based on U-Net architectures, reach their limits when it comes to generalizability and image quality. Optimization-based methods, in turn, require fine-tuning for each individual subject, which impairs the textual controllability of image generation. A new framework called InstantCharacter now promises to overcome these challenges.

InstantCharacter, developed by a team at Tencent and InstantX, is based on a scalable diffusion transformer model. This framework offers three key advantages: First, it enables the personalization of characters in terms of appearance, pose, and style across different domains while ensuring high image quality. Second, the framework introduces a scalable adapter with stacked transformer encoders. This adapter processes the features of characters from open domains and interacts seamlessly with the latent space of modern diffusion transformers. Third, a comprehensive dataset with over 10 million examples was created for training the framework. This dataset consists of paired (multi-view characters) and unpaired (text-image combinations) subsets. This dual data structure allows for simultaneous optimization of identity consistency and textual editability through separate learning paths.

The use of a diffusion transformer model as a foundation represents an important difference from previous approaches. While U-Net architectures are widely used in image generation, transformer models offer advantages in terms of scalability and detail fidelity due to their ability to model long dependencies. The scalable adapter allows InstantCharacter to adapt to a variety of characters without requiring retraining for each individual character. This significantly improves the efficiency and flexibility of the system.

The extensive dataset used for training InstantCharacter also plays a crucial role in its performance. The combination of paired and unpaired data allows the model to learn both the consistency of character representation across different views and the ability to generate and modify images based on textual input.

Qualitative experiments have shown that InstantCharacter is capable of generating high-quality images that are both consistent in terms of character representation and controllable by text input. This sets a new standard for character-based image generation and opens up new possibilities for applications in areas such as game development, virtual reality, and digital art.

The developers of InstantCharacter have made the project's source code publicly available to encourage further research and development in this area. With the release of InstantCharacter, an important step is taken towards more flexible and efficient personalized image generation.

Bibliography: