Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization - 专知论文

会员服务 ·

0

正则化 · 样本 · 范例 · 值函数 · 分布偏移 ·

2023 年 3 月 28 日

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

翻译：离线RL无需OOD动作：通过隐式值正则化的样本内学习

Haoran Xu,Li Jiang,Jianxiong Li,Zhuoran Yang,Zhaoran Wang,Victor Wai Kin Chan,Xianyuan Zhan

from arxiv, ICLR 2023 notable top 5%

Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

翻译：大多数离线强化学习方法在改进策略以超越行为策略和约束策略以限制其偏离行为策略之间存在权衡，因为使用分布外动作计算Q值会因分布偏移而导致误差。最近提出的样本内学习范式（如IQL）通过仅使用数据样本进行分位数回归来改进策略，展现出巨大潜力，因为它无需查询任何未见动作的值函数即可学习最优策略。然而，这类方法如何在值函数学习中处理分布偏移尚不清楚。在本工作中，我们有一个关键发现：样本内学习范式是在隐式值正则化框架下产生的。这深入解释了样本内学习范式为何有效，即它对策略施加了隐式值正则化。基于IVR框架，我们进一步提出了两种实用算法：稀疏Q学习和指数Q学习，它们采用了现有工作中相同的值正则化，但以完全样本内方式实现。与IQL相比，我们发现我们的算法在学习值函数时引入了稀疏性，使其在噪声数据场景下更加鲁棒。我们还在D4RL基准数据集上验证了SQL和EQL的有效性，并通过在小数据场景下与CQL的比较展示了样本内学习的优势。

0

相关内容

正则化

在数学，统计学和计算机科学中，尤其是在机器学习和逆问题中，正则化是添加信息以解决不适定问题或防止过度拟合的过程。正则化适用于不适定的优化问题中的目标函数。

【AI+商业投资】法国兴业银行《深度强化学习在投资组合分配中的应用》26页PPT，Deep Reinforcement Learning for portfolio allocation

【AI+商业投资】法国兴业银行《深度强化学习在投资组合分配中的应用》26页PPT，Deep Reinforcement Learning for portfolio allocation

专知会员服务

24+阅读 · 2022年4月1日

【AAAI 2022】一种样本高效的基于模型的保守 actor-critic 算法

【AAAI 2022】一种样本高效的基于模型的保守 actor-critic 算法

专知会员服务

24+阅读 · 2022年1月10日

【NeurIPS 2021】设置多智能体策略梯度的方差

【NeurIPS 2021】设置多智能体策略梯度的方差

专知会员服务

21+阅读 · 2021年10月24日

【ICLR2021】一种基于距离度量学习及行为正则化的完全离线的元强化学习方法

专知会员服务

17+阅读 · 2021年2月9日

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

专知会员服务

11+阅读 · 2020年12月8日

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

浅聊对比学习（Contrastive Learning）第一弹

浅聊对比学习（Contrastive Learning）第一弹

PaperWeekly

1+阅读 · 2022年6月10日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

变步长和变正则化因子的子带自适应滤波算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

复杂数据模型中的分布逼近方法

国家自然科学基金

3+阅读 · 2014年12月31日

空间分数阶Schr？dinger方程的时间分裂谱方法

国家自然科学基金

0+阅读 · 2014年12月31日

广义线性模型的组变量选择及其在信用评分中的应用

国家自然科学基金

2+阅读 · 2014年12月31日

ROC1活化mTOR通路促进膀胱癌侵袭及转移的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

褐煤O2/CO2燃烧Cr的氧化机理及模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

小GTP酶Rab23通过Rac1调控乳腺癌细胞迁移和侵袭的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

基于风险测度的供应链鲁棒建模与策略研究

国家自然科学基金

2+阅读 · 2012年12月31日

lncRNA MALAT1通过SF2/ASF调控癌基因RON剪接促进膀胱癌演进的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于支持向量机的复杂连续系统强化学习控制研究

国家自然科学基金

12+阅读 · 2008年12月31日

RAMP: Hierarchical Reactive Motion Planning for Manipulation Tasks Using Implicit Signed Distance Functions

Arxiv

0+阅读 · 2023年5月17日

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Arxiv

0+阅读 · 2023年5月17日

Reinforcement Learning for Safe Robot Control using Control Lyapunov Barrier Functions

Arxiv

0+阅读 · 2023年5月16日

Combining datasets to increase the number of samples and improve model fitting

Arxiv

0+阅读 · 2023年5月16日

Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback

Arxiv

0+阅读 · 2023年5月15日

Causality and Generalizability: Identifiability and Learning Methods

Arxiv

12+阅读 · 2021年10月4日

A continual learning survey: Defying forgetting in classification tasks

Arxiv

32+阅读 · 2021年4月16日

CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning

Arxiv

11+阅读 · 2021年2月18日

Self-correcting Q-Learning

Arxiv

11+阅读 · 2020年12月2日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

2+阅读 · 6月23日

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

3+阅读 · 6月23日

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

5+阅读 · 6月23日

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

3+阅读 · 6月23日

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

4+阅读 · 6月23日

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

4+阅读 · 6月23日

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

5+阅读 · 6月23日

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

2+阅读 · 6月23日

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

6+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

8+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

7+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

6+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

8+阅读 · 6月22日

相关VIP内容

【AI+商业投资】法国兴业银行《深度强化学习在投资组合分配中的应用》26页PPT，Deep Reinforcement Learning for portfolio allocation

【AI+商业投资】法国兴业银行《深度强化学习在投资组合分配中的应用》26页PPT，Deep Reinforcement Learning for portfolio allocation

专知会员服务

24+阅读 · 2022年4月1日

【AAAI 2022】一种样本高效的基于模型的保守 actor-critic 算法

【AAAI 2022】一种样本高效的基于模型的保守 actor-critic 算法

专知会员服务

24+阅读 · 2022年1月10日

【NeurIPS 2021】设置多智能体策略梯度的方差

【NeurIPS 2021】设置多智能体策略梯度的方差

专知会员服务

21+阅读 · 2021年10月24日

【ICLR2021】一种基于距离度量学习及行为正则化的完全离线的元强化学习方法

专知会员服务

17+阅读 · 2021年2月9日

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

专知会员服务

11+阅读 · 2020年12月8日

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 世界动作模型：少做梦，多行动

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

美以伊冲突：无人机与人工智能的运用

相关资讯

浅聊对比学习（Contrastive Learning）第一弹

浅聊对比学习（Contrastive Learning）第一弹

PaperWeekly

1+阅读 · 2022年6月10日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

RAMP: Hierarchical Reactive Motion Planning for Manipulation Tasks Using Implicit Signed Distance Functions

Arxiv

0+阅读 · 2023年5月17日

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Arxiv

0+阅读 · 2023年5月17日

Reinforcement Learning for Safe Robot Control using Control Lyapunov Barrier Functions

Arxiv

0+阅读 · 2023年5月16日

Combining datasets to increase the number of samples and improve model fitting

Arxiv

0+阅读 · 2023年5月16日

Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback

Arxiv

0+阅读 · 2023年5月15日

Causality and Generalizability: Identifiability and Learning Methods

Arxiv

12+阅读 · 2021年10月4日

A continual learning survey: Defying forgetting in classification tasks

Arxiv

32+阅读 · 2021年4月16日

CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning

Arxiv

11+阅读 · 2021年2月18日

Self-correcting Q-Learning

Arxiv

11+阅读 · 2020年12月2日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

相关基金

变步长和变正则化因子的子带自适应滤波算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

复杂数据模型中的分布逼近方法

国家自然科学基金

3+阅读 · 2014年12月31日

空间分数阶Schr？dinger方程的时间分裂谱方法

国家自然科学基金

0+阅读 · 2014年12月31日

广义线性模型的组变量选择及其在信用评分中的应用

国家自然科学基金

2+阅读 · 2014年12月31日

ROC1活化mTOR通路促进膀胱癌侵袭及转移的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

褐煤O2/CO2燃烧Cr的氧化机理及模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

小GTP酶Rab23通过Rac1调控乳腺癌细胞迁移和侵袭的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

基于风险测度的供应链鲁棒建模与策略研究

国家自然科学基金

2+阅读 · 2012年12月31日

lncRNA MALAT1通过SF2/ASF调控癌基因RON剪接促进膀胱癌演进的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于支持向量机的复杂连续系统强化学习控制研究

国家自然科学基金

12+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员