Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore. All code and datasets are provided at https://github.com/THUDM/VisionReward.
翻译:视觉生成模型在合成逼真图像与视频方面取得了显著进展,然而,使其输出在关键维度上与人类偏好保持一致仍是一个持续存在的挑战。尽管基于人类反馈的强化学习为偏好对齐提供了前景,但现有的视觉生成奖励模型仍面临局限,包括缺乏可解释性的黑盒评分以及可能由此产生的意外偏差。我们提出了VisionReward,一个用于学习图像与视频生成中人类视觉偏好的通用框架。具体而言,我们采用分层视觉评估框架来捕捉细粒度的人类偏好,并利用线性加权实现可解释的偏好学习。此外,我们提出了一种多维度一致性策略,用于在视觉生成的偏好优化过程中将VisionReward作为奖励模型。实验表明,无论是在机器指标还是人类评估上,VisionReward均能显著优于现有的图像和视频奖励模型。值得注意的是,VisionReward在偏好预测准确率上比VideoScore高出17.2%,而使用VisionReward的文本到视频模型相较于使用VideoScore的相同模型,在成对胜率上实现了31.6%的提升。所有代码与数据集均发布于 https://github.com/THUDM/VisionReward。