VisionReward: A New Approach to Aligning Image and Video Generation with Human Preferences

VisionReward: A Promising Approach to Improving the Alignment of Image and Video Generation with Human Preferences

The rapid development of visual generation models, particularly in the field of text-to-image and text-to-video, has achieved impressive progress in recent years. Models like Stable Diffusion, DALL-E, and Midjourney can generate high-quality images from text descriptions, while models like CogVideo and Make-A-Video are already creating short videos. A central concern in the further development of these technologies is the optimal alignment of the generated content with human preferences. This is where the new VisionReward model comes in.

Challenges in Aligning Generative Models

Despite the progress, challenges remain in aligning text-to-image and text-to-video models with human preferences:

Current evaluation models that aim to simulate human preferences are often biased and lack transparency. Human preferences are based on a complex interplay of various factors, which can lead to differences in evaluation.

The evaluation of videos is particularly difficult. Assessing dynamic aspects such as motion realism and fluidity presents a special challenge.

Existing optimization methods tend to over-optimize or neglect certain factors, leading to suboptimal results.

VisionReward: A Multi-Dimensional Evaluation Model

VisionReward pursues a new approach to improve the alignment of generated images and videos with human preferences. The model is based on a finely graded, multi-dimensional evaluation system. Human preferences are broken down into different dimensions, each represented by a series of assessment questions. The answers to these questions are linearly weighted and combined into an interpretable and precise value.

Specifics of Video Evaluation

To meet the challenges of video evaluation, VisionReward systematically analyzes various dynamic characteristics of videos, such as motion stability and quality. As a result, VisionReward achieves significant improvements in predicting video preferences compared to existing methods, such as VideoScore.

Multi-Objective Preference Optimization (MPO)

Building upon VisionReward, the MPO algorithm was developed to stably optimize visual generation models and avoid over- or under-optimization of certain factors. MPO utilizes the multi-dimensional evaluations of VisionReward to specifically consider the various aspects of human preferences and achieve a balanced result.

Data Basis and Training

An extensive dataset of human evaluations was created for the training of VisionReward. This dataset includes millions of assessment questions on a variety of images and videos. The data was collected from various sources to ensure high diversity. The training of VisionReward takes place in two steps: First, a vision-language model is trained to answer the assessment questions. Then, the answers are weighted using linear regression to predict human preferences.

Results and Outlook

Initial results show that VisionReward achieves high accuracy in predicting human preferences. Particularly in video evaluation, the model significantly outperforms existing methods. The MPO algorithm enables stable optimization of visual generation models and leads to a better alignment with human preferences. VisionReward and MPO represent a promising approach to further improve the quality and user-friendliness of image and video generation models. The code and datasets are publicly available to promote research and development in this area.

Bibliography Xu, J., Huang, Y., Cheng, J., Yang, Y., Xu, J., Wang, Y., Duan, W., Yang, S., Jin, Q., Li, S., Teng, J., Yang, Z., Zheng, W., Liu, X., Ding, M., Zhang, X., Gu, X., Huang, S., Huang, M., Tang, J., & Dong, Y. (2024). VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation. arXiv preprint arXiv:2412.21059. https://arxiv.org/abs/2412.21059 https://arxiv.org/html/2412.21059v1/ https://www.researchgate.net/publication/387540226_VisionReward_Fine-Grained_Multi-Dimensional_Human_Preference_Learning_for_Image_and_Video_Generation https://github.com/THUDM/VisionReward https://www.alphaxiv.org/abs/2412.21059 https://huggingface.co/papers?date=2025-01-06 https://x.com/gm8xx8?lang=bn https://github.com/THUDM/ImageReward https://paperswithcode.com/latest?page=3 https://paperreading.club/page?id=275964