Divot: A Diffusion-Based Video Tokenizer for AI Video Understanding and Generation

The Fusion of Diffusion and Video: Divot as an Innovative Video Tokenizer for Understanding and Generation

The world of Artificial Intelligence (AI) is in constant motion. Particularly in the field of multimodal large language models (MLLMs), there is rapid progress. These models combine text comprehension with the ability to generate images. A new branch of research is now dedicated to extending these capabilities to videos. The challenge lies in developing a versatile video tokenizer that captures both spatial and temporal information from videos and makes it usable for LLMs. At the same time, these representations should be decodable back into realistic video clips to enable video generation.

A promising approach in this area is Divot, a diffusion-based video tokenizer. Divot utilizes the diffusion process for self-supervised learning of video representations. The core idea behind it is the following: If a video diffusion model is able to effectively denoise noisy video clips by using the features of a video tokenizer as a condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model acts as a decoder, reconstructing videos from their representations.

How Divot Works

Divot consists of a pre-trained Vision Transformer (ViT) encoder, a spatio-temporal transformer, and a Perceiver Resampler. These components extract video representations from video frames captured at a low frame rate (fps). The video representations serve as a condition for a pre-trained video diffusion model, DynamiCrafter, which predicts the noise added to the VAE latents of video frames. After training, the video diffusion model can generate realistic video clips from noise by using the video representations provided by Divot as a condition.

By combining Divot with a pre-trained LLM like Mistral-7B, Divot-LLM is created. This model is trained with a next-word prediction objective on video-caption data, using the spatio-temporal representations from Divot as input for video understanding. For video generation, the distribution of continuous video features is modeled with a Gaussian Mixture Model (GMM). The LLM is trained to predict GMM parameters. During inference, samples are drawn from the predicted GMM distribution and used as a condition for the video decoder to generate video clips.

Applications and Potential

Divot-LLM shows promising results in various video understanding tasks and zero-shot video generation. Due to the versatility of the video tokenizer, Divot-LLM also enables video storytelling, where interlocking narratives and corresponding videos are generated that are temporally coherent. This is achieved by fine-tuning on a specific animation dataset.

Research on Divot and similar technologies is of great importance for the further development of AI systems. The ability to understand and create dynamic visual content opens up new possibilities in areas such as automated video analysis, the creation of personalized video content, and the development of interactive virtual environments. The combination of diffusion models and LLMs represents an important step towards a more comprehensive AI that is capable of interpreting and shaping the complex world of videos.

The Future of Video AI

Divot and similar developments are still in their early stages. Research is focused on improving the quality of generated videos, expanding the possibilities for controlling the generation process, and developing more efficient training methods. The scalability of the models is also an important aspect to enable the generation of longer and more complex videos. The combination of diffusion models with LLMs opens a new chapter in AI research and promises exciting advancements in the world of video AI.

Mindverse, as a German company for AI-powered content creation, is observing these developments with great interest. The integration of technologies like Divot into the Mindverse platform could enable users to create and edit videos in innovative ways in the future. From the automated generation of video content to interactive video editing - the possibilities are diverse and open up new perspectives for content creation.

Bibliography: - https://arxiv.org/abs/2412.04432 - https://arxiv.org/html/2412.04432v1 - https://github.com/TencentARC/Divot - https://paperswithcode.com/latest?page=3 - https://www.researchgate.net/scientific-contributions/Karttikeya-Mangalam-2128114566 - https://www.aimodels.fyi/authors/arxiv/Yuying%20Ge - https://github.com/showlab/Awesome-Video-Diffusion - https://www.researchgate.net/figure/Datasets-used-for-training-the-tokenizer-and-Divot-LLM_tbl1_386502767 - https://www.catalyzex.com/author/Ying%20Shan - https://arxiv-sanity-lite.com/?rank=pid&pid=2412.04446