Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
翻译:并行测试时扩展通常需要分别训练生成模型和验证模型,导致高昂的训练与推理成本。我们提出优势解耦偏好优化(ADPO),这是一种统一的强化学习框架,可在单一策略内联合学习答案生成与自验证。ADPO引入两项创新:提升验证能力的偏好验证奖励机制,以及实现生成与验证协同优化的解耦优化机制。具体而言,偏好验证奖励通过计算正负样本的平均验证分数作为决策阈值,当预测正确性与答案正确性一致时提供正向反馈。同时,优势解耦优化分别计算生成与验证的优势值,应用词元掩码隔离梯度传播,并结合掩码GRPO目标函数,在保持生成质量的同时校准验证分数。ADPO在验证AUC上最高提升+34.1%,推理时间降低53.5%,并在MathVista/MMMU数据集上准确率显著提升+2.8%/+1.4%,在ReasonSeg上cIoU提升+1.9,在AndroidControl/GUI Odyssey上步骤成功率提升+1.7%/+1.0%。