Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
翻译:在可验证奖励的强化学习(RLVR)中,数据选择是提升大语言模型(LLM)推理能力的关键环节。当前的数据选择方法主要基于启发式策略,缺乏理论保证与泛化性。本研究提出一种基于理论的方法,利用影响函数估计每个数据点对学习目标的贡献。为克服在线影响估计所需策略展开的过高计算成本,我们引入一种离策略影响估计方法,利用预收集的离线轨迹高效近似数据影响。此外,为处理LLM的高维梯度,我们采用稀疏随机投影降低维度,提升存储与计算效率。基于这些技术,我们开发了\textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance(\textbf{CROPI}),一种多阶段强化学习框架,迭代地为当前策略选择最具影响力的数据。在参数规模达7B的模型上的实验表明,CROPI显著加速了训练过程。在1.5B参数模型上,与全数据集训练相比,CROPI实现了2.66倍的步骤级加速,且每阶段仅使用10%的数据。我们的结果凸显了基于影响的数据选择在高效RLVR中的巨大潜力。