Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
翻译:近年来,多模态奖励模型(RMs)的显著进展极大地推动了视觉生成领域的发展。现有框架通常采用Bradley-Terry式偏好建模或利用生成式视觉语言模型(VLMs)作为评判器,随后通过强化学习优化视觉生成模型。然而,当前奖励模型存在固有局限:它们通常遵循“一刀切”范式,假设存在单一的偏好分布或依赖固定的评估标准。因此,这些模型对内容特定的视觉线索不敏感,导致与主观且依赖于上下文的人类偏好产生系统性错位。为此,受人类评估机制的启发,我们提出了UnifiedReward-Flex——一个用于视觉生成的统一个性化奖励模型,它将奖励建模与灵活且上下文自适应的推理相结合。具体而言,给定一个提示词及生成的视觉内容,该模型首先解析语义意图并基于视觉证据进行锚定,随后通过实例化预定义和自生成高层维度下的细粒度标准,动态构建分层评估体系。我们的训练流程遵循两阶段过程:(1)首先从先进的闭源视觉语言模型中蒸馏出结构化、高质量的推理轨迹,以引导监督微调(SFT),使模型具备灵活且上下文自适应的推理能力;(2)随后在精心构建的偏好数据对上执行直接偏好优化(DPO),以进一步增强推理的忠实度与判别对齐能力。为验证其有效性,我们将UnifiedReward-Flex集成至GRPO框架中进行图像与视频合成,大量实验结果证明了其优越性。