Large-Scale Pretraining Improves Grounded Video Caption Generation

Top post
Grounded Video Description Generation through Large-Scale Pre-training
The automatic generation of descriptions for videos is an active research area in the field of Artificial Intelligence. A promising approach is so-called "grounded" video description generation, where objects in the description are linked to the corresponding visual elements in the video. This allows for a deeper understanding of the video content and opens up new application possibilities, e.g., in video search or human-computer interaction.
Researchers recently presented a new approach for grounded video description generation based on large-scale pre-training. The core of this approach is the creation of a comprehensive dataset with automatically generated annotations that describe objects in videos and their temporal development. These annotations serve as the basis for pre-training a neural network, which is then fine-tuned on smaller, manually annotated datasets.
Automatic Annotation and the HowToGround1M Dataset
The automatic annotation of videos presents a major challenge. The new approach utilizes existing, single-frame-based annotations and aggregates them into temporally dense and consistent descriptions of object trajectories. This approach was applied to the HowTo100M dataset to create a new, large-scale dataset called HowToGround1M. HowToGround1M contains over one million videos with automatically generated, grounded descriptions.
The GROVE Model and the iGround Dataset
For grounded video description generation, a new model called GROVE (Grounded Video Caption Generation) was developed. GROVE is first pre-trained on the HowToGround1M dataset. To evaluate and improve the model's performance on high-quality data, another dataset called iGround was created. iGround consists of 3500 videos with manually annotated, grounded descriptions. Pre-training on HowToGround1M followed by fine-tuning on iGround enables GROVE to generate precise and detailed descriptions of videos, accurately locating objects in the video temporally.
Results and Outlook
The results show that this approach surpasses the current state-of-the-art on iGround, VidSTG, and ActivityNet-Entities datasets compared to existing methods. The combination of large-scale pre-training on automatically annotated data and fine-tuning on smaller, manually annotated datasets proves to be an effective strategy for grounded video description generation. This technology has the potential to fundamentally change the interaction with videos and open up new possibilities in various application areas. Future research could focus on improving automatic annotation, developing even more powerful models, and expanding to further use cases.
For Mindverse, a German company specializing in AI-powered content creation, these advancements in video processing are particularly relevant. Mindverse offers an all-in-one platform for AI texts, images, research, and more. The development of customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems is also part of Mindverse's portfolio. Grounded video description generation could be integrated into Mindverse's platform in the future and offer users new possibilities for automated video analysis and editing.
Bibliography: Kazakos, E., Schmid, C., & Sivic, J. (2025). Large-scale Pre-training for Grounded Video Caption Generation. *arXiv preprint arXiv:2503.10781*. Grounded Video Caption Generation. (n.d.). *Papers with Code*. Retrieved from https://paperswithcode.com/paper/grounded-video-caption-generation Yang, Z., Chen, Y.-C., Li, Y., & Sun, C. (2023). Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 18134–18144. Zhang, M., Li, X., Yang, J., & Nevatia, R. (2023). Training-free Video Temporal Grounding using Large-scale Pre-trained Models. *arXiv preprint arXiv:2407.06304*. Huang, Z., Hu, H., & Zhou, J. (2023). Grounded Video Caption Generation. *arXiv preprint arXiv:2301.13080*. Wu, Y., Chen, S., Li, J., & Wang, Y. (2023). Entity-aware Video Description Generation. *Proceedings of the ACM International Conference on Multimedia Retrieval*, 266–270.