VistaDPO Improves Large Video Model Performance by Reducing Hallucinations

Optimizing Large Video Models: VistaDPO Improves Video Hallucination and Understanding

Large Video Models (LVMs), built upon Large Language Models (LLMs), are showing promising results in the field of video understanding. However, they often struggle with challenges such as a lack of alignment with human intuition and the occurrence of video hallucinations, i.e., the generation of unrealistic or illogical video content. A new approach called VistaDPO (Video Hierarchical Spatial-Temporal Direct Preference Optimization) promises to address these issues.

VistaDPO aims to improve the preference alignment between text and video on three hierarchical levels:

* Instance level: Aligning the overall video content with the responses. * Temporal level: Aligning the temporal semantics of the video with the descriptions of the events. * Perceptual level: Aligning spatial objects with language tokens.

This hierarchical approach allows for finer tuning of the LVMs and helps ensure coherence between visual and linguistic information. The instance level ensures that the generated video as a whole matches the query. The temporal level focuses on the correct temporal sequence of events within the video. Finally, the perceptual level ensures that the objects depicted in the video match the corresponding linguistic descriptions.

One challenge in developing VistaDPO was the lack of datasets for fine-tuned preference optimization between video and language. To address this issue, VistaDPO-7k was created, a dataset containing 7,200 question-answer pairs annotated with selected and rejected answers, as well as spatial-temporal information such as timestamps, keyframes, and bounding boxes. This dataset allows LVMs to be trained with a rich and detailed dataset, thereby improving their performance.

Extensive experiments with benchmarks such as video hallucination, video question answering, and description tasks show that VistaDPO significantly improves the performance of existing LVMs and effectively reduces the mismatch between video and language, as well as hallucinations. The results suggest that the hierarchical optimization and the detailed dataset of VistaDPO make an important contribution to the development of more robust and reliable LVMs.

Research on LVMs is a dynamic field, and VistaDPO represents a promising step towards developing models that can better understand and generate videos. The ability to accurately interpret and create videos opens up a variety of applications in areas such as automatic video analysis, content creation, and human-computer interaction. The availability of code and data for VistaDPO allows other researchers to build on these results and further advance the development of LVMs.

For companies like Mindverse, which specialize in AI-powered content creation and processing, these advancements are particularly relevant. The improved performance of LVMs through approaches like VistaDPO opens up new possibilities for developing innovative solutions in areas such as chatbots, voicebots, AI search engines, and knowledge systems. The ability to accurately analyze and generate videos can significantly increase the quality and efficiency of content creation processes and open up new ways to interact with digital content.

Bibliography Huang, H., Chen, H., Wu, S., Luo, M., Fu, J., Du, X., Zhang, H., & Fei, H. (2025). VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models. arXiv preprint arXiv:2504.13122. Qing, Y., Wu, C., Zhou, S., Zhang, Y., & Loy, C. C. (2024). Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19249-19258). Li, R., Zhang, J., Li, H., & Snoek, C. G. (2024). Temporal preference optimization for long-form video understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 10, pp. 11602-11610). Shaham, T. Z., Dekel, T., & Michaeli, T. (2024). Scaling Laws for Multilingual Generative Language Models. arXiv preprint arXiv:2404.01258. Su, J., Zhou, K., Zhang, G., Li, Z., Cao, Y., & Wu, F. (2025). ChatPaper: Augmenting LLMs with Search for Enhanced Dialogue. arXiv preprint arXiv:2501.13919.