AI Model Enhances Identity Preservation in Text-to-Video Generation

Identity-Preserving Video Generation with Enhanced Face Knowledge

Generating videos from text descriptions is a rapidly growing field in Artificial Intelligence. Particularly exciting is identity-preserving text-to-video generation (IPT2V), where the appearance of a specific person is maintained in the generated video. Tuning-free approaches that adapt large, pre-trained video diffusion models are gaining popularity due to their efficiency and scalability. However, the challenge of generating convincing facial dynamics while preserving identity remains.

A new approach called FantasyID promises progress in this area. This framework is based on diffusion transformers (DiT) and enhances the face knowledge of the pre-trained video model. A core component is the integration of 3D face geometry priors to ensure plausible facial structures during video synthesis. To prevent the model from learning simple "copy-paste" shortcuts that merely replicate the reference face across all frames, a multi-view face augmentation strategy is employed. This captures various 2D facial features, thereby increasing the dynamism of facial expressions and head poses.

The fusion of the 2D and 3D features serves as control for the generation process. Instead of directly feeding this control information into the DiT layers via cross-attention, FantasyID uses a learnable, layer-specific adaptation mechanism. This allows for the selective injection of the fused features into the individual DiT layers and ensures a balance between identity preservation and motion dynamics.

The Importance of 3D Face Geometry and Multi-View Augmentation

The use of 3D face geometry priors contributes significantly to the realistic representation of facial structures. By considering the three-dimensional shape of the face, artifacts and distortions that can occur when generating videos from 2D images are minimized. The multi-view augmentation expands the model's understanding of different facial expressions and viewpoints. By presenting the face from different perspectives, the model learns to recognize and maintain identity regardless of head pose or facial expression.

Layer-Specific Adaptation for Optimized Control

The learnable, layer-specific adaptation mechanism is another important component of FantasyID. It allows for finer control over the generation process by feeding the control information selectively into the different layers of the diffusion transformer. This leads to a better balance between preserving identity and the natural representation of facial movements and expressions.

Experimental Results and Outlook

Initial experimental results suggest that FantasyID achieves improved performance compared to existing tuning-free IPT2V methods. The combination of 3D face geometry priors, multi-view augmentation, and layer-specific adaptation enables the generation of videos with convincing facial dynamics and high identity accuracy. Future research could focus on further optimizing the framework and extending it to other application areas.

Bibliographie: Zhang, Y., Wang, Q., Jiang, F., Fan, Y., Xu, M., & Qi, Y. (2025). FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation. arXiv preprint arXiv:2502.13995. https://arxiv.org/abs/2502.13995 https://arxiv.org/html/2502.13995v1 https://huggingface.co/papers https://github.com/showlab/Awesome-Video-Diffusion https://twitter.com/gastronomy/status/1892802926571135289 https://www.researchgate.net/scientific-contributions/Di-Qiu-2248315314 https://paperswithcode.com/paper/id-animator-zero-shot-identity-preserving https://dreamidentity.github.io/ https://www.qeios.com/read/TZIID6 https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_MetaPortrait_Identity-Preserving_Talking_Head_Generation_With_Fast_Personalized_Adaptation_CVPR_2023_paper.pdf ```