Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt's solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.

翻译：强化学习（RL）微调已成为提升大型语言模型（LLMs）推理能力的关键技术。然而，其有效性在很大程度上依赖于训练数据的选择。近期研究进展凸显了在线提示选择方法的重要性，这些方法通常将训练集中在当前策略下部分解决或中等难度的示例上，从而产生更有效的模型更新。尽管这些方法在训练步数方面显著加速了RL微调，但它们也因需要在大型候选批次上进行大量LLM推演以识别信息丰富的样本而带来巨大的计算开销，这一开销可能超过微调过程本身。为应对这一挑战，本研究提出了动态预测采样（DPS），该方法通过在昂贵的推演之前推断其学习动态，在线预测并选择信息丰富的提示。具体而言，我们引入了一种新视角，将RL微调过程中每个提示的解决进度建模为一个动态系统，其中解决程度表示为状态，状态转移由隐马尔可夫模型刻画。利用历史推演奖励信号，我们执行在线贝叶斯推断以估计演化的状态分布，推断结果为无需推演密集型筛选的高效提示选择提供了预测先验。在包括数学、规划和视觉几何在内的多种推理任务上的实证结果表明，DPS显著减少了冗余推演，加速了训练过程，并实现了更优的推理性能。