The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.
翻译:直接偏好优化(DPO)的有效性取决于能反映多模态任务中关键质量差异的偏好数据。现有流程通常依赖离策略扰动或粗粒度结果信号,难以适用于细粒度视觉推理。我们提出rDPO,一种基于逐例评分规则的偏好优化框架。针对每个图像-指令对,我们构建包含必要与附加准则的清单式评分规则,用于评估任意策略生成的响应。该指令-规则库可离线构建,并在在线策略数据构建过程中复用。在公开奖励模型基准上,基于评分规则的提示方法显著提升30B-A3B判别模型性能,使其接近GPT-5.4水平。在公开下游基准中,基于评分规则的筛选使宏平均分数提升至82.69,而基于结果的筛选则从81.14降至75.82。在综合性基准的可扩展性评估中,rDPO达到61.01,显著优于风格约束基线(52.36)并超越基础模型(59.48)。这些结果表明,将在线策略数据构建与逐例准则级反馈结合能有效提升视觉偏好优化性能。