Mind the Gap: Offline Policy Optimization for Imperfect Rewards - 专知论文

会员服务 ·

0

优化器 · Performer · 层 · 多样性 · Learning ·

2023 年 2 月 3 日

Mind the Gap: Offline Policy Optimization for Imperfect Rewards

翻译：关注差距：面向不完美奖励的离线策略优化

Jianxiong Li,Xiao Hu,Haoran Xu,Jingjing Liu,Xianyuan Zhan,Qing-Shan Jia,Ya-Qin Zhang

from arxiv, Accept by ICLR2023. The first two authors contributed equally

Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.

翻译：奖励函数在强化学习中至关重要，它作为引导信号激励智能体完成给定任务，但同时也是出了名的难以设计。在许多情况下，只能获得不完美的奖励，这会导致强化学习智能体性能显著下降。在本研究中，我们提出了一种统一的离线策略优化方法——RGM（奖励差距最小化），它能够智能地处理多种类型的不完美奖励。RGM被形式化为一个双层优化问题：上层优化一个奖励修正项，使其与某些专家数据的访问分布匹配；下层则利用修正后的奖励求解一个悲观强化学习问题。通过利用下层的对偶性，我们推导出一个可行的算法，使得无需任何在线交互即可进行基于样本的学习。综合实验表明，在多种不完美奖励设置下，RGM取得了优于现有方法的性能。此外，RGM能够有效纠正偏离专家偏好的错误或不一致奖励，并从有偏奖励中提取有用信息。

0

相关内容

优化器

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

84+阅读 · 2019年10月9日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

LibRec 精选：推荐系统的论文与源码

LibRec 精选：推荐系统的论文与源码

LibRec智能推荐

14+阅读 · 2018年11月29日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

磁流体动力学方程的有限元法及多重网格算法

国家自然科学基金

0+阅读 · 2015年12月31日

多陶瓷相强化Fe基纳米复合材料的激光选区熔化原位合成机理研究

国家自然科学基金

0+阅读 · 2015年12月31日

线粒体TRAP1分子介导Ago2蛋白表达在肠癌转移中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

各向异性SmCo5/Fe(Co)纳米复合永磁材料的研究

国家自然科学基金

0+阅读 · 2014年12月31日

Kahler 曲面中特殊曲面的研究

国家自然科学基金

0+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

GOCE引力梯度数据的时间序列分析与误差处理

国家自然科学基金

0+阅读 · 2013年12月31日

非光滑系统动力学中奇异吸引子的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

p进表示的伽罗瓦上同调

国家自然科学基金

0+阅读 · 2008年12月31日

Optimal Transport for Offline Imitation Learning

Arxiv

0+阅读 · 2023年3月24日

Remind of the Past: Incremental Learning with Analogical Prompts

Arxiv

0+阅读 · 2023年3月24日

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

Arxiv

0+阅读 · 2023年3月23日

On Designing a Learning Robot: Improving Morphology for Enhanced Task Performance and Learning

Arxiv

0+阅读 · 2023年3月23日

Split-Et-Impera: A Framework for the Design of Distributed Deep Learning Applications

Arxiv

0+阅读 · 2023年3月22日

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

Arxiv

0+阅读 · 2023年3月22日

Deep RL with Hierarchical Action Exploration for Dialogue Generation

Arxiv

0+阅读 · 2023年3月22日

Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL

Arxiv

0+阅读 · 2023年3月21日

A Simple Framework for Contrastive Learning of Visual Representations

Arxiv

21+阅读 · 2020年2月13日

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

Arxiv

12+阅读 · 2019年3月8日

VIP会员

文章信息

相关主题

最新内容

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

专知会员服务

7+阅读 · 6月20日

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

4+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

6+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

6+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

7+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

11+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

11+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

7+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

11+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

8+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

18+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

9+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

6+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

8+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

8+阅读 · 6月17日

相关VIP内容

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

84+阅读 · 2019年10月9日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

LibRec 精选：推荐系统的论文与源码

LibRec 精选：推荐系统的论文与源码

LibRec智能推荐

14+阅读 · 2018年11月29日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Optimal Transport for Offline Imitation Learning

Arxiv

0+阅读 · 2023年3月24日

Remind of the Past: Incremental Learning with Analogical Prompts

Arxiv

0+阅读 · 2023年3月24日

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

Arxiv

0+阅读 · 2023年3月23日

On Designing a Learning Robot: Improving Morphology for Enhanced Task Performance and Learning

Arxiv

0+阅读 · 2023年3月23日

Split-Et-Impera: A Framework for the Design of Distributed Deep Learning Applications

Arxiv

0+阅读 · 2023年3月22日

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

Arxiv

0+阅读 · 2023年3月22日

Deep RL with Hierarchical Action Exploration for Dialogue Generation

Arxiv

0+阅读 · 2023年3月22日

Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL

Arxiv

0+阅读 · 2023年3月21日

A Simple Framework for Contrastive Learning of Visual Representations

Arxiv

21+阅读 · 2020年2月13日

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

Arxiv

12+阅读 · 2019年3月8日

相关基金

磁流体动力学方程的有限元法及多重网格算法

国家自然科学基金

0+阅读 · 2015年12月31日

多陶瓷相强化Fe基纳米复合材料的激光选区熔化原位合成机理研究

国家自然科学基金

0+阅读 · 2015年12月31日

线粒体TRAP1分子介导Ago2蛋白表达在肠癌转移中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

各向异性SmCo5/Fe(Co)纳米复合永磁材料的研究

国家自然科学基金

0+阅读 · 2014年12月31日

Kahler 曲面中特殊曲面的研究

国家自然科学基金

0+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

GOCE引力梯度数据的时间序列分析与误差处理

国家自然科学基金

0+阅读 · 2013年12月31日

非光滑系统动力学中奇异吸引子的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

p进表示的伽罗瓦上同调

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员