Apollo Study Reveals Key Factors for Effective Video Understanding in Large Multimodal Models

Video Understanding in Large Multimodal Models: An Analysis of Apollo

The integration of video recognition into large multimodal models (LMMs) is rapidly advancing. However, the mechanisms that drive the video understanding of these models are still poorly understood. Many design decisions in this area are therefore made without a sound basis or analysis. The high computational costs for training and evaluating such models, coupled with limited open research, hinder the development of video LMMs. A new study called "Apollo: An Exploration of Video Understanding in Large Multimodal Models" now sheds light on the key factors for effective video understanding in LMMs.

Scaling Consistency: Efficient Learning with Smaller Models

The study identifies "scaling consistency" as an important factor contributing to the high computational demands in video LMM research. Design and training decisions made with smaller models and datasets (up to a critical size) can be effectively transferred to larger models. This approach allows for more efficient research and development, as insights from smaller, less resource-intensive experiments can be scaled to larger models.

Optimizing Video-Specific Aspects

Building on the concept of scaling consistency, the study investigated various video-specific aspects of video LMMs, including:

- Video Sampling - Architectures - Data Composition - Training Schedules

For example, it was shown that FPS sampling during training is significantly superior to uniform frame sampling and which vision encoders are best suited for video representation. FPS-based frame selection allows the model to focus on the most relevant information in the video, optimizing computational power and improving accuracy.

Apollo: A Family of State-of-the-Art LMMs

The insights from the study led to the development of Apollo, a family of state-of-the-art LMMs that achieve superior performance across various model sizes. The Apollo models can efficiently process hours-long videos. Apollo-3B outperforms most existing 7B models with an impressive score of 55.1 on LongVideoBench. Apollo-7B sets new standards compared to other 7B LMMs with 70.9 points on MLVU and 63.3 points on Video-MME.

Conclusion: A Step Towards Better Understanding of Video LMMs

The Apollo study provides valuable insights into the workings of video LMMs. By identifying scaling consistency and optimizing video-specific aspects, it enables more efficient development and improved performance. The Apollo models demonstrate the potential of this approach and set new benchmarks in the field of video understanding. Future research can build on these findings to further enhance the capabilities of video LMMs and unlock new application possibilities.

Bibliography: Zohar, O., et al. "Apollo: An Exploration of Video Understanding in Large Multimodal Models." arXiv preprint arXiv:2412.10360 (2024). https://chatpaper.com/chatpaper/ja?id=4&date=1734278400&page=1 https://www.aipapernews.com/ https://arxiv.org/abs/2403.16998 https://github.com/friedrichor/Awesome-Multimodal-Papers https://arxiv.org/abs/2404.05726 https://github.com/BAAI-Agents/GPA-LM https://2023.emnlp.org/program/accepted_main_conference/ https://bohrium.dp.tech/paper/arxiv/2411.13112 https://openaccess.thecvf.com/content/WACV2024W/LLVM-AD/papers/Cui_A_Survey_on_Multimodal_Large_Language_Models_for_Autonomous_Driving_WACVW_2024_paper.pdf https://deepmind.google/technologies/gemini/pro/