Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
翻译:近年来,多模态奖励模型的进展显著推动了视觉生成技术的发展。现有框架通常采用布拉德利-特里式偏好建模方法,或利用生成式视觉语言模型作为评判器,随后通过强化学习优化视觉生成模型。然而,当前奖励模型存在固有局限性:它们往往遵循"一刀切"范式,假设存在单一偏好分布或依赖固定评估标准。这导致模型对内容特定的视觉线索不敏感,从而与主观且依赖上下文的人类偏好产生系统性偏差。为此,受人类评估机制启发,我们提出UnifiedReward-Flex——一个将奖励建模与灵活上下文自适应推理相结合的统一个性化视觉生成奖励模型。具体而言,给定提示词与生成的视觉内容,该模型首先解析语义意图并基于视觉证据进行锚定,随后通过实例化预定义和自生成高层维度下的细粒度标准,动态构建分层评估体系。我们的训练流程采用两阶段方法:(1)首先从先进的闭源视觉语言模型中蒸馏结构化高质量推理轨迹以引导监督微调,使模型具备灵活且上下文自适应的推理能力;(2)随后对精心构建的偏好对执行直接偏好优化,以进一步增强推理保真度与判别对齐能力。为验证有效性,我们将UnifiedReward-Flex集成至GRPO框架进行图像与视频合成实验,大量结果表明了该模型的优越性。