Q-Eval-100K Dataset and Q-Eval-Score Advance Text-to-Vision Evaluation

Evaluating Text-to-Image and Text-to-Video Content with Q-Eval-100K

The rapid development of AI models for generating images and videos from text descriptions (Text-to-Vision) requires robust evaluation methods. Two central aspects are at the forefront: the visual quality of the generated content and its consistency with the text input. While objective evaluation models already exist, their performance strongly depends on the quality and quantity of the human evaluations on which they are trained. A new dataset, Q-Eval-100K, addresses this challenge and promises to significantly improve the evaluation of Text-to-Vision content.

Q-Eval-100K: A Comprehensive Dataset for Evaluating Visual Quality and Alignment

Q-Eval-100K is currently the largest dataset of its kind and contains 960,000 human evaluations, called Mean Opinion Scores (MOS), for visual quality and alignment. The dataset covers both text-to-image and text-to-video models and comprises 100,000 instances, including 60,000 images and 40,000 videos. The human evaluations focus explicitly on the two mentioned aspects and thus offer a solid basis for the training and evaluation of assessment models.

Q-Eval-Score: A Unified Evaluation Model

Based on the Q-Eval-100K dataset, Q-Eval-Score was developed, a unified model for evaluating both visual quality and alignment. Particularly noteworthy is Q-Eval-Score's ability to handle longer text inputs and their alignment with the generated content, a challenge that often posed problems for previous models. Considering the context of the text input plays a crucial role in this.

Experimental Results and Outlook

Initial tests show that Q-Eval-Score achieves superior performance compared to existing models in terms of evaluating visual quality and alignment. Furthermore, the model exhibits high generalizability and delivers consistent results even when evaluating content not included in the Q-Eval-100K dataset. These results underscore the value of the Q-Eval-100K dataset and the Q-Eval-Score model based on it for the further development of Text-to-Vision technologies.

The publication of Q-Eval-100K and Q-Eval-Score is an important step for research and development in the field of AI-generated visual content. By providing a comprehensive and high-quality dataset as well as a powerful evaluation model, the development and optimization of Text-to-Vision systems is further advanced. This opens up new possibilities for creative applications and innovative solutions in various areas, from automated content creation to virtual worlds and interactive experiences.

Bibliography: Zhang, Zicheng, et al. "Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content." arXiv preprint arXiv:2503.02357 (2025). Wang, et al. "Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting." CVPR 2023. Jia, et al. "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision." ICML 2021.