Unified Personalized Reward Model for Vision Generation

Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.

翻译：近年来，多模态奖励模型（RMs）的显著进展极大地推动了视觉生成领域的发展。现有框架通常采用Bradley-Terry式偏好建模或利用生成式视觉语言模型（VLMs）作为评判器，随后通过强化学习优化视觉生成模型。然而，当前奖励模型存在固有局限：它们通常遵循“一刀切”范式，假设存在单一的偏好分布或依赖固定的评估标准。因此，这些模型对内容特定的视觉线索不敏感，导致与主观且依赖于上下文的人类偏好产生系统性错位。为此，受人类评估机制的启发，我们提出了UnifiedReward-Flex——一个用于视觉生成的统一个性化奖励模型，它将奖励建模与灵活且上下文自适应的推理相结合。具体而言，给定一个提示词及生成的视觉内容，该模型首先解析语义意图并基于视觉证据进行锚定，随后通过实例化预定义和自生成高层维度下的细粒度标准，动态构建分层评估体系。我们的训练流程遵循两阶段过程：（1）首先从先进的闭源视觉语言模型中蒸馏出结构化、高质量的推理轨迹，以引导监督微调（SFT），使模型具备灵活且上下文自适应的推理能力；（2）随后在精心构建的偏好数据对上执行直接偏好优化（DPO），以进一步增强推理的忠实度与判别对齐能力。为验证其有效性，我们将UnifiedReward-Flex集成至GRPO框架中进行图像与视频合成，大量实验结果证明了其优越性。