Video-MMMU: A New Benchmark for Evaluating Video Comprehension in AI Models

Learning from Videos: Video-MMMU Evaluates Knowledge Acquisition of AI Models

The rapid development of large multimodal models (LMMs) has revolutionized the way artificial intelligence processes information. LMMs can combine and analyze text, images, audio, and video, giving them the potential to handle more complex tasks and learn in a more human-like way. A key aspect of human learning is the ability to extract and apply knowledge from videos. But how good are LMMs at learning from videos? A new benchmark called Video-MMMU aims to find out.

Humans acquire knowledge in three cognitive stages: perceiving information, understanding knowledge, and adapting knowledge to solve new problems. Videos are an effective medium for this learning process. However, previous benchmarks have not systematically evaluated the ability of LMMs to acquire knowledge from videos. Video-MMMU closes this gap by testing the performance of LMMs in three stages corresponding to these cognitive stages: perception, understanding, and adaptation.

The benchmark consists of 300 expert-level videos from six different disciplines, including computer science, physics, chemistry, mathematics, economics, and medicine. For each video, 3 questions were formulated, each assigned to one of the three cognitive stages. In total, Video-MMMU comprises 900 human-annotated questions. The questions were carefully selected to test the LMMs' abilities in the respective cognitive stages.

Evaluating LMMs with Video-MMMU reveals a sharp decline in performance with increasing cognitive demands. While the models perform relatively well at perceiving information, they struggle to understand the acquired knowledge and apply it to new problems. This indicates a significant gap between human and machine knowledge acquisition.

The Importance of Video-MMMU for AI Research

Video-MMMU provides a standardized method for evaluating the ability of LMMs to learn from videos. This is an important step towards developing AI systems that can acquire and apply knowledge independently. The benchmark's results provide valuable insights into the strengths and weaknesses of current LMMs and can help guide future research efforts.

An interesting result of the study is the introduction of the metric "Δknowledge" (Delta-knowledge). This metric quantifies the performance improvement after watching a video and allows for a more precise measurement of knowledge gain. Δknowledge could prove to be a useful tool for the development and evaluation of learning algorithms for LMMs.

Developing AI systems that can learn from videos like humans is a complex undertaking. Video-MMMU represents an important milestone on this path. By systematically evaluating the knowledge acquisition of LMMs, the benchmark helps to reveal the limitations of current AI systems and drive the development of more powerful models. For companies like Mindverse, which specialize in the development of AI solutions, these findings are invaluable for shaping the next generation of AI applications.

Bibliographie: - https://arxiv.org/abs/2501.13826 - https://paperreading.club/page?id=279735 - https://www.researchgate.net/publication/384171451_MMMU_A_Massive_Multi-Discipline_Multimodal_Understanding_and_Reasoning_Benchmark_for_Expert_AGI - https://arxiv.org/abs/2501.12380 - https://github.com/LongHZ140516/awesome-framework-gallery - https://www.chatpaper.com/chatpaper/fr?id=4&date=1737648000&page=1 - https://assets.amazon.science/b0/2b/e74dd4f84f188701fd06792670e7/the-amazon-nova-family-of-models-technical-report-and-model-card.pdf - https://mmmu-benchmark.github.io/ - https://www.researchgate.net/publication/377620439_Methods_and_Technologies_for_Supporting_Knowledge_Sharing_within_Learning_Communities_A_Systematic_Literature_Review - https://4euplus.eu/4EU-971-version1-4eu_education_framework_examples_of_good_practices_in_teaching_and_learning___6_.pdf