Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.
翻译:强化学习已成为微调大语言模型的有效方法,尤其在提升其推理能力方面。然而,强化学习微调仍然需要极高的资源消耗,且现有研究大多忽视了数据效率问题。本文提出两种技术以提升大语言模型强化学习微调的数据效率:难度导向的在线数据选择与轨迹回放。我们引入自适应难度的概念来指导在线数据选择,优先选择中等难度的问题,这类问题更可能产生具有信息量的学习信号。为高效估计自适应难度,我们开发了一个基于注意力机制的框架,该框架仅需对少量参考问题集进行轨迹采样。其余问题的自适应难度则根据其与该参考集的相似度进行估计。为进一步降低轨迹采样成本,我们借鉴传统强化学习中的经验回放机制,提出了轨迹回放技术。该技术通过复用近期轨迹,在保持更新稳定性的同时降低了单步计算量。在6种大语言模型与数据集组合上的实验表明,我们的方法将强化学习微调时间减少了23%至62%,同时达到了与原始GRPO算法相当的性能水平。代码已发布于 https://github.com/ASTRAL-Group/data-efficient-llm-rl。