Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
翻译:奖励模型在将AI行为与人类偏好对齐方面发挥着关键作用,但其面临两个根本性挑战:(1) 模态不平衡:大多数奖励模型主要集中于文本和图像模态,对视频、音频等其他模态的支持有限;(2) 偏好僵化:基于固定二元偏好对进行训练,无法捕捉个性化偏好的复杂性和多样性。为应对上述挑战,我们提出Omni-Reward,这是迈向支持自由形式偏好的通用全模态奖励建模的一步,其包含:(1) 评估:我们引入Omni-RewardBench,首个支持自由形式偏好的全模态奖励模型基准,涵盖文本、图像、视频、音频和3D五种模态下的九项任务;(2) 数据:我们构建了Omni-RewardData,一个包含24.8万通用偏好对和6.9万指令调优对的多模态偏好数据集,用于训练通用全模态奖励模型;(3) 模型:我们提出Omni-RewardModel,它包含判别式和生成式奖励模型,并在Omni-RewardBench及其他广泛使用的奖励建模基准上取得了强劲性能。