Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.
翻译:强化学习已成为在复杂任务(从数学问题求解到想象推理)中开发推理模型的关键技术。这些模型的优化通常依赖于策略梯度方法,其有效性取决于优势函数的准确估计。然而,主流方法通常采用静态优势估计,这种做法因忽略训练样本随时间变化的动态效用而导致低效的信用分配。这一局限性引发次优的策略更新,进而表现为收敛速度减慢和学习不稳定性增加,因为模型无法有效适应不断演化的样本效用。为解决此问题,我们提出 \textbf{ADORA}(\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation),一种新颖的策略优化框架。ADORA 通过在在线模型推演过程中,根据训练数据不断变化的效用,自适应地将其分类为暂时优势样本与劣势样本,从而动态调整优势函数的权重。这种定制化的数据区分策略使得 ADORA 能够无缝集成到现有策略优化算法中,无需显著的架构修改,使策略能够优先从更具信息量的经验中学习,从而实现更高效的政策更新。在不同模型家族和多样数据规模上的广泛评估表明,ADORA 是一个稳健且高效的框架。它在几何和数学任务中显著增强了长程推理能力,持续取得显著的性能提升,且无需敏感的超级参数调优。