Goldilocks强化学习：通过调节任务难度规避稀疏奖励以实现推理 (Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning) - 专知论文

会员服务 ·

0

稀疏 · 稀疏奖励 · 样本 · 排序 · 数据采样 ·

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

翻译：Goldilocks强化学习：通过调节任务难度规避稀疏奖励以实现推理

Ilia Mahrooghi,Aryo Lotfi,Emmanuel Abbe

from arxiv, 21 pages, 12 figures

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

翻译：强化学习已成为解锁大型语言模型推理能力的有力范式。然而，依赖稀疏奖励使得这一过程样本效率极低，因为模型必须在反馈极少的情况下探索巨大的搜索空间。尽管经典课程学习旨在通过按复杂度排序数据来缓解这一问题，但针对特定模型的正确排序往往难以确定。为此，我们提出Goldilocks——一种新颖的教师驱动数据采样策略，旨在预测每个问题对学生模型的难度。教师模型依据Goldilocks原则（即选择难度适中、既不过易也不过难的问题）为学生模型筛选合适难度的问题，同时使用GRPO方法训练学生模型。通过利用学生在已见样本上的表现，教师模型能持续适应学生不断演进的能力。在OpenMathReasoning数据集上的实验表明，在相同计算预算下，Goldilocks数据采样显著提升了采用标准GRPO训练模型的性能。

0

相关内容

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

面向大型推理模型的强化学习综述

面向大型推理模型的强化学习综述

专知会员服务

29+阅读 · 2025年9月11日

深度强化学习中的奖励模型：综述

深度强化学习中的奖励模型：综述

专知会员服务

29+阅读 · 2025年6月20日

【博士论文】强化学习智能体的奖励函数设计

【博士论文】强化学习智能体的奖励函数设计

专知会员服务

48+阅读 · 2025年4月8日

【干货书】基于模型的强化学习:使用python工具箱从数据到连续动作，275页pdf

【干货书】基于模型的强化学习:使用python工具箱从数据到连续动作，275页pdf

专知会员服务

65+阅读 · 2022年12月21日

中科院自动化所最新《分布式深度强化学习DDRL》综述，14页pdf阐述DDRL与多玩家多智能体学习工具箱

中科院自动化所最新《分布式深度强化学习DDRL》综述，14页pdf阐述DDRL与多玩家多智能体学习工具箱

专知会员服务

41+阅读 · 2022年12月2日

【ICML2022】稀疏奖励目标条件强化学习的阶段性自模仿约减

【ICML2022】稀疏奖励目标条件强化学习的阶段性自模仿约减

专知会员服务

19+阅读 · 2022年6月28日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

【微信@CIKM2021 】强化学习推荐模型的知识蒸馏探索之路

【微信@CIKM2021 】强化学习推荐模型的知识蒸馏探索之路

专知会员服务

28+阅读 · 2021年12月4日

【CMU-Google-斯坦福】可控行为的弱监督强化学习，Weakly-Supervised RL

【CMU-Google-斯坦福】可控行为的弱监督强化学习，Weakly-Supervised RL

专知会员服务

22+阅读 · 2020年4月8日

基于模型的强化学习综述

基于模型的强化学习综述

专知

42+阅读 · 2022年7月13日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

强化学习《奖励函数设计: Reward Shaping》详细解读

强化学习《奖励函数设计: Reward Shaping》详细解读

深度强化学习实验室

18+阅读 · 2020年9月1日

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

深度强化学习实验室

13+阅读 · 2020年8月23日

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

深度强化学习实验室

43+阅读 · 2020年7月6日

一文了解强化学习

一文了解强化学习

AI100

15+阅读 · 2018年8月20日

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

大数据文摘

11+阅读 · 2018年6月12日

论强化学习和概率推断的等价性：一种全新概率模型

论强化学习和概率推断的等价性：一种全新概率模型

机器之心

26+阅读 · 2018年5月5日

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

专知

17+阅读 · 2018年4月28日

【强化学习】强化学习/增强学习/再励学习介绍

【强化学习】强化学习/增强学习/再励学习介绍

产业智能官

10+阅读 · 2018年2月23日

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

42+阅读 · 2015年12月31日

基于复杂图知识表示的终身强化学习研究

国家自然科学基金

39+阅读 · 2015年12月31日

结合知识图谱的概率话题模型研究

国家自然科学基金

10+阅读 · 2015年12月31日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

23+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

几类典型稀疏优化问题的算法、理论及应用

国家自然科学基金

3+阅读 · 2014年12月31日

线性时序关系下推理的概率计量化模型

国家自然科学基金

0+阅读 · 2014年12月31日

基于贝叶斯推理的模糊逻辑强化学习模型研究

国家自然科学基金

18+阅读 · 2012年12月31日

强化学习关键技术及其在机器人行为学习中的应用

国家自然科学基金

23+阅读 · 2009年12月31日

QuRL: Efficient Reinforcement Learning with Quantized Rollout

Arxiv

0+阅读 · 2月15日

Experiential Reinforcement Learning

Arxiv

0+阅读 · 2月15日

Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

Arxiv

0+阅读 · 2月13日

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Arxiv

0+阅读 · 2月5日

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

Arxiv

0+阅读 · 2月5日

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Arxiv

0+阅读 · 2月2日

Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Arxiv

0+阅读 · 1月30日

Coupled Variational Reinforcement Learning for Language Model General Reasoning

Arxiv

0+阅读 · 1月27日

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Arxiv

0+阅读 · 1月15日

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Arxiv

0+阅读 · 1月14日

VIP会员

文章信息

相关主题

相关VIP内容

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

面向大型推理模型的强化学习综述

面向大型推理模型的强化学习综述

专知会员服务

29+阅读 · 2025年9月11日

深度强化学习中的奖励模型：综述

深度强化学习中的奖励模型：综述

专知会员服务

29+阅读 · 2025年6月20日

【博士论文】强化学习智能体的奖励函数设计

【博士论文】强化学习智能体的奖励函数设计

专知会员服务

48+阅读 · 2025年4月8日

【干货书】基于模型的强化学习:使用python工具箱从数据到连续动作，275页pdf

【干货书】基于模型的强化学习:使用python工具箱从数据到连续动作，275页pdf

专知会员服务

65+阅读 · 2022年12月21日

中科院自动化所最新《分布式深度强化学习DDRL》综述，14页pdf阐述DDRL与多玩家多智能体学习工具箱

中科院自动化所最新《分布式深度强化学习DDRL》综述，14页pdf阐述DDRL与多玩家多智能体学习工具箱

专知会员服务

41+阅读 · 2022年12月2日

【ICML2022】稀疏奖励目标条件强化学习的阶段性自模仿约减

【ICML2022】稀疏奖励目标条件强化学习的阶段性自模仿约减

专知会员服务

19+阅读 · 2022年6月28日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

【微信@CIKM2021 】强化学习推荐模型的知识蒸馏探索之路

【微信@CIKM2021 】强化学习推荐模型的知识蒸馏探索之路

专知会员服务

28+阅读 · 2021年12月4日

【CMU-Google-斯坦福】可控行为的弱监督强化学习，Weakly-Supervised RL

【CMU-Google-斯坦福】可控行为的弱监督强化学习，Weakly-Supervised RL

专知会员服务

22+阅读 · 2020年4月8日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体记忆深度剖析：评价指标与系统局限性的分类体系及实证分析

《可信人工智能赋能系统的支柱》

【CMU博士论文】可靠轨迹预测的分层基石：数据、评估与方法

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

相关资讯

基于模型的强化学习综述

基于模型的强化学习综述

专知

42+阅读 · 2022年7月13日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

强化学习《奖励函数设计: Reward Shaping》详细解读

强化学习《奖励函数设计: Reward Shaping》详细解读

深度强化学习实验室

18+阅读 · 2020年9月1日

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

深度强化学习实验室

13+阅读 · 2020年8月23日

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

深度强化学习实验室

43+阅读 · 2020年7月6日

一文了解强化学习

一文了解强化学习

AI100

15+阅读 · 2018年8月20日

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

大数据文摘

11+阅读 · 2018年6月12日

论强化学习和概率推断的等价性：一种全新概率模型

论强化学习和概率推断的等价性：一种全新概率模型

机器之心

26+阅读 · 2018年5月5日

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

专知

17+阅读 · 2018年4月28日

【强化学习】强化学习/增强学习/再励学习介绍

【强化学习】强化学习/增强学习/再励学习介绍

产业智能官

10+阅读 · 2018年2月23日

相关论文

QuRL: Efficient Reinforcement Learning with Quantized Rollout

Arxiv

0+阅读 · 2月15日

Experiential Reinforcement Learning

Arxiv

0+阅读 · 2月15日

Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

Arxiv

0+阅读 · 2月13日

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Arxiv

0+阅读 · 2月5日

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

Arxiv

0+阅读 · 2月5日

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Arxiv

0+阅读 · 2月2日

Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Arxiv

0+阅读 · 1月30日

Coupled Variational Reinforcement Learning for Language Model General Reasoning

Arxiv

0+阅读 · 1月27日

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Arxiv

0+阅读 · 1月15日

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Arxiv

0+阅读 · 1月14日

相关基金

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

42+阅读 · 2015年12月31日

基于复杂图知识表示的终身强化学习研究

国家自然科学基金

39+阅读 · 2015年12月31日

结合知识图谱的概率话题模型研究

国家自然科学基金

10+阅读 · 2015年12月31日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

23+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

几类典型稀疏优化问题的算法、理论及应用

国家自然科学基金

3+阅读 · 2014年12月31日

线性时序关系下推理的概率计量化模型

国家自然科学基金

0+阅读 · 2014年12月31日

基于贝叶斯推理的模糊逻辑强化学习模型研究

国家自然科学基金

18+阅读 · 2012年12月31日

强化学习关键技术及其在机器人行为学习中的应用

国家自然科学基金

23+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员