CINEMA: Generating Coherent Multi-Subject Videos with MLLMs

The Future of Video Generation: Coherent Multi-Subject Videos with CINEMA

The generation of videos has made remarkable progress through the use of deep learning models, particularly diffusion models. High-quality videos can be generated from text descriptions or individual images. However, the personalized creation of videos with multiple subjects that interact consistently over time and space remains a challenge.

Previous approaches mainly focus on mapping images of the subjects to keywords in text input. However, this method carries the risk of ambiguities and limits the ability to effectively model relationships between the subjects. An example of this would be the generation of a video with the subjects "cat" and "mouse". A simple text prompt could lead to undesirable results, as the relationship between cat and mouse is not clearly defined. Is the cat playing with the mouse? Is it chasing the mouse? Or are they even cuddling? The model's interpretation of the text can lead to different and possibly unwanted scenarios.

A new approach called CINEMA (Coherent Multi-Subject Video Generation via MLLM-Based Guidance) promises a remedy here. CINEMA leverages the capabilities of multimodal large language models (MLLMs) to generate coherent multi-subject videos. Unlike previous methods, CINEMA does not require explicit mapping of subject images to text. Instead, the MLLM interprets the relationships between the subjects based on the provided images and generates a coherent video based on that.

How does CINEMA work?

The core of CINEMA lies in the use of MLLMs. These models are trained to understand and relate different modalities, such as text and images. By inputting reference images of the desired subjects, the MLLM can infer the relationships between them and develop a narrative understanding of the scene. This understanding is then used to guide the video generation and ensure that the interactions between the subjects are coherent and meaningful.

The flexibility of CINEMA allows for the generation of videos with a variable number of subjects. This opens up new possibilities for personalized content, as users can individually adjust the number and type of subjects in their videos.

Advantages of CINEMA

CINEMA offers several advantages over conventional methods:

- Elimination of the need for explicit mappings between subject images and text - Reduction of ambiguities in the interpretation of the scene - Reduced annotation effort - Scalability through the use of large and diverse datasets for training - Flexibility in the number of subjects - Improved consistency of the subjects and coherence of the video

Outlook

CINEMA represents a promising advance in the field of video generation. The use of MLLMs enables the creation of coherent and personalized multi-subject videos that can be applied in various areas, including storytelling, interactive media, and personalized video generation. Future research could focus on improving the fine-tuning of the MLLMs and expanding the possibilities for controlling the generated videos. The development of CINEMA and similar approaches could fundamentally change the way we create and consume videos.

Bibliography https://arxiv.org/abs/2309.15091 https://arxiv.org/html/2412.14484v1 https://github.com/yzhang2016/video-generation-survey/blob/main/video-generation.md https://mn.cs.tsinghua.edu.cn/xinwang/PDF/papers/2025_Modular-Cam%20Modular%20Dynamic%20Camera-view%20Video%20Generation%20with%20LLM.pdf https://huggingface.co/papers?q=video%20generators https://www.youtube.com/watch?v=30v_7pE6AM8 https://openreview.net/forum?id=sKNIjS2brr https://mn.cs.tsinghua.edu.cn/xinwang/PDF/papers/2024_VideoDreamer%20Customized%20Multi-Subject%20Text-to-Video%20Generation%20with%20Disen-Mix%20Finetuning%20on%20Language-Video%20Foundation%20Models.pdf https://proceedings.neurips.cc/paper_files/paper/2024/file/3cbf33008024aa1612ce853ef78e0e53-Paper-Conference.pdf https://www.researchgate.net/publication/386555265_Mind_the_Time_Temporally-Controlled_Multi-Event_Video_Generation