Assessing Physical Commonsense in Video LLMs with PhysGame Benchmark

Top post
Artificial Intelligence (AI) is rapidly evolving, and its applications in video production and analysis are particularly promising. Video-based large language models (Video-LLMs) are capable of interpreting and reacting to dynamic visual content. A key aspect of human understanding of videos is physical common sense – the intuitive understanding of how objects interact in the real world. However, evaluating this ability in Video-LLMs has been under-researched.
Gameplay Videos as a Test Environment
Gameplay videos offer a unique opportunity to test the understanding of physical common sense. They often contain "glitches" – errors in the game programming that lead to physically impossible scenarios. These deviations from the laws of physics provide an ideal basis for assessing whether a Video-LLM recognizes these discrepancies.
PhysGame: A New Benchmark
To evaluate this ability, PhysGame was developed – a benchmark specifically designed for detecting violations of physical common sense in gameplay videos. PhysGame comprises 880 videos with glitches covering four fundamental physical domains: mechanics, kinematics, optics, and material properties. A total of 12 different physical principles are tested. The videos in PhysGame were carefully selected and categorized to enable a comprehensive and differentiated evaluation.
Challenges for Open-Source Video-LLMs
Initial tests with PhysGame have shown that open-source Video-LLMs perform significantly worse compared to proprietary models. This highlights the need to improve the capabilities of open-source models in the area of physical common sense.
PhysInstruct and PhysDPO: Datasets for Model Optimization
To close this gap, two new datasets were developed: PhysInstruct and PhysDPO. PhysInstruct consists of 140,057 question-answer pairs and serves for instruction tuning of Video-LLMs. PhysDPO, on the other hand, contains 34,358 training pairs for preference optimization. The "non-preferred" answers contained within were specifically generated through manipulated metadata (misleading titles), reduced frame rates, and lower spatial resolutions. These "hacking" strategies help the model develop more robust and accurate answers by learning to recognize and ignore misleading information.
PhysVLM: An Improved Video-LLM
Based on these datasets, PhysVLM was developed, a Video-LLM specifically trained on physical knowledge. Tests with PhysGame and other benchmarks show that PhysVLM achieves state-of-the-art results compared to other models. This underscores the potential of targeted training and specialized datasets for improving the capabilities of Video-LLMs.
Significance for AI-Powered Content Creation
For companies like Mindverse, which offer AI-powered content solutions, these developments are of great importance. A better understanding of physical common sense enables the development of more powerful video analysis tools, more realistic animations, and more interactive virtual environments. The research findings surrounding PhysGame, PhysInstruct, PhysDPO, and PhysVLM contribute to expanding the boundaries of what is possible in AI-powered content creation.
Cao, M., Tang, H., Zhao, H., Guo, H., Liu, J., Zhang, G., Liu, R., Sun, Q., Reid, I., & Liang, X. (2024). PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos. arXiv preprint arXiv:2412.01800. Hugging Face. Papers. Hugging Face. PhysGame. ResearchGate. PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos. ChatPaper. (2024). HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering. ChatPaper. (2024). PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos. ResearchGate. Youngjin Kim. arXiv. Computer Vision and Pattern Recognition. PaperReading.club. OpenGVLab. PhyGenBench. ```