Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes - 专知论文

会员服务 ·

0

估计/估计量 · Markov · 部分可观测马尔可夫决策过程 · Learning · Processing（编程语言） ·

2023 年 3 月 22 日

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

翻译：近端强化学习：部分可观测马尔可夫决策过程中的高效离线策略评估

Andrew Bennett,Nathan Kallus

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We then show how to construct semiparametrically efficient estimators in these settings. We term the resulting framework proximal reinforcement learning (PRL). We demonstrate the benefits of PRL in an extensive simulation study and on the problem of sepsis management.

翻译：在离线强化学习应用于观测数据（例如医疗或教育领域）时，一个普遍担忧是观测到的动作可能受未观测因素影响，从而引发混淆偏差，并导致在完美马尔可夫决策过程（MDP）模型假设下所得到的估计结果出现偏差。本文通过考虑部分可观测MDP（POMDP）中的离线策略评估来应对这一问题。具体而言，我们考虑在POMDP中估计给定目标策略的价值，其中轨迹仅包含由另一未知策略（可能依赖于未观测状态）生成的局部状态观测。我们解决两个问题：哪些条件能够使我们从观测数据中识别出目标策略价值，以及在识别成立时如何最优地估计该价值。为回答这些问题，我们将近端因果推断框架扩展到POMDP设定中，提供了多种通过所谓桥函数的存在性实现识别的场景。随后，我们展示了如何在这些场景中构建半参数有效估计量。我们将所提出的框架称为近端强化学习（PRL）。我们通过广泛的仿真研究以及败血症管理问题，展示了PRL的优势。

0

相关内容

估计/估计量

估计/估计量

斯坦福最新《强化学习》2023课程，Emma Brunskill主讲，附PPT下载

斯坦福最新《强化学习》2023课程，Emma Brunskill主讲，附PPT下载

专知会员服务

46+阅读 · 2023年1月17日

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

24+阅读 · 2022年3月19日

【干货书】强化学习算法，98页pdf综合讲解人工智能和机器学习

【干货书】强化学习算法，98页pdf综合讲解人工智能和机器学习

专知会员服务

68+阅读 · 2021年2月21日

【DeepMind】强化学习教程，83页ppt

【DeepMind】强化学习教程，83页ppt

专知会员服务

160+阅读 · 2020年8月7日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

86+阅读 · 2020年2月18日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【2020密歇根大学论文】基于学习的序列决策算法的公平性综述论文，Fairness in Learning-Based Sequential Decision Algorithms: A Survey

【2020密歇根大学论文】基于学习的序列决策算法的公平性综述论文，Fairness in Learning-Based Sequential Decision Algorithms: A Survey

专知会员服务

23+阅读 · 2020年1月15日

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

专知会员服务

16+阅读 · 2019年12月10日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

量化金融强化学习论文集合

量化金融强化学习论文集合

专知

14+阅读 · 2019年12月18日

强化学习扫盲贴：从Q-learning到DQN

强化学习扫盲贴：从Q-learning到DQN

夕小瑶的卖萌屋

52+阅读 · 2019年10月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习初探 - 从多臂老虎机问题说起

强化学习初探 - 从多臂老虎机问题说起

专知

10+阅读 · 2018年4月3日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

24+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

考虑总量折扣的运输服务采购问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于网络拓扑结构的随机多自主体系统分布式控制

国家自然科学基金

0+阅读 · 2013年12月31日

动态多摄像头环境中拥挤多目标跟踪的联合建模与协同优化

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据挖掘近似函数的智能制造装备生产决策参数稳健优化方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

多元函数的稀疏逼近与随机逼近

国家自然科学基金

1+阅读 · 2012年12月31日

基于贝叶斯推理的模糊逻辑强化学习模型研究

国家自然科学基金

18+阅读 · 2012年12月31日

数值求解最优控制：动态规划方法

国家自然科学基金

1+阅读 · 2009年12月31日

随机微分方程的逼近

国家自然科学基金

0+阅读 · 2009年12月31日

More Like Real World Game Challenge for Partially Observable Multi-Agent Cooperation

Arxiv

0+阅读 · 2023年5月15日

Private and Communication-Efficient Algorithms for Entropy Estimation

Arxiv

0+阅读 · 2023年5月12日

Differentially Private Set-Based Estimation Using Zonotopes

Arxiv

0+阅读 · 2023年5月12日

Copula-based Sensitivity Analysis for Multi-Treatment Causal Inference with Unobserved Confounding

Arxiv

0+阅读 · 2023年5月11日

HoneyIoT: Adaptive High-Interaction Honeypot for IoT Devices Through Reinforcement Learning

Arxiv

0+阅读 · 2023年5月10日

Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization

Arxiv

0+阅读 · 2023年5月10日

Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Arxiv

0+阅读 · 2023年5月10日

Probabilistic Guarantees for Safe Reinforcement Learning in Continuous Action Spaces via Temporal Logic

Arxiv

0+阅读 · 2023年5月10日

Recent Advances in Reinforcement Learning in Finance

Arxiv

11+阅读 · 2021年12月8日

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

Arxiv

80+阅读 · 2020年1月19日

VIP会员

文章信息

相关主题

估计/估计量

部分可观测马尔可夫决策过程

Processing（编程语言）

最新内容

《无人机对海面作战影响评估》

《无人机对海面作战影响评估》

专知会员服务

7+阅读 · 7月21日

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

专知会员服务

8+阅读 · 7月21日

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

专知会员服务

2+阅读 · 7月21日

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

专知会员服务

3+阅读 · 7月21日

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

五角大楼新设无人机办公室（DRPM-UxS）将如何重塑美国无人系统格局（附美国防部设立备忘录）

专知会员服务

5+阅读 · 7月21日

印度精确打击与指挥架构的断层

印度精确打击与指挥架构的断层

专知会员服务

5+阅读 · 7月20日

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

《NASA喷气推进实验室：高耐久轻质常驻空观测系统（HELIOS）》429页

专知会员服务

7+阅读 · 7月20日

美空军AI完成F-16战斗机自主空战历史性试飞

美空军AI完成F-16战斗机自主空战历史性试飞

专知会员服务

6+阅读 · 7月20日

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

《美政府问责局——武器系统年度评估（2026年）：强制要求成熟技术或可推动转向快速交付》249页

专知会员服务

8+阅读 · 7月20日

《美国陆军：通过弹性分布式模型库实现自适应AI优势》

《美国陆军：通过弹性分布式模型库实现自适应AI优势》

专知会员服务

6+阅读 · 7月20日

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

专知会员服务

8+阅读 · 7月20日

综述 | 终身视觉表征：持续自监督学习CSSL系统综述

综述 | 终身视觉表征：持续自监督学习CSSL系统综述

专知会员服务

8+阅读 · 7月20日

深入Project Maven：为何人工智能在战场上依然失灵

深入Project Maven：为何人工智能在战场上依然失灵

专知会员服务

15+阅读 · 7月19日

锻造未来士兵：外骨骼、基因工程与赛博格

锻造未来士兵：外骨骼、基因工程与赛博格

专知会员服务

7+阅读 · 7月19日

《无人机系统（UAS）通信网状网络试验性部署》50页报告

《无人机系统（UAS）通信网状网络试验性部署》50页报告

专知会员服务

10+阅读 · 7月19日

相关VIP内容

斯坦福最新《强化学习》2023课程，Emma Brunskill主讲，附PPT下载

斯坦福最新《强化学习》2023课程，Emma Brunskill主讲，附PPT下载

专知会员服务

46+阅读 · 2023年1月17日

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

24+阅读 · 2022年3月19日

【干货书】强化学习算法，98页pdf综合讲解人工智能和机器学习

【干货书】强化学习算法，98页pdf综合讲解人工智能和机器学习

专知会员服务

68+阅读 · 2021年2月21日

【DeepMind】强化学习教程，83页ppt

【DeepMind】强化学习教程，83页ppt

专知会员服务

160+阅读 · 2020年8月7日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

86+阅读 · 2020年2月18日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【2020密歇根大学论文】基于学习的序列决策算法的公平性综述论文，Fairness in Learning-Based Sequential Decision Algorithms: A Survey

【2020密歇根大学论文】基于学习的序列决策算法的公平性综述论文，Fairness in Learning-Based Sequential Decision Algorithms: A Survey

专知会员服务

23+阅读 · 2020年1月15日

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

专知会员服务

16+阅读 · 2019年12月10日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《可损耗无人系统规模化应用对美国军事转型的战略影响（2022-2030）》2026年270页

综述 | 面向5G/6G网络的LLM智能体AI：架构、协议与标准化

《无人机对海面作战影响评估》

博士论文 | 后训练如何损害大模型生成多样性？SimpleStrat与Stylus

相关资讯

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

量化金融强化学习论文集合

量化金融强化学习论文集合

专知

14+阅读 · 2019年12月18日

强化学习扫盲贴：从Q-learning到DQN

强化学习扫盲贴：从Q-learning到DQN

夕小瑶的卖萌屋

52+阅读 · 2019年10月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习初探 - 从多臂老虎机问题说起

强化学习初探 - 从多臂老虎机问题说起

专知

10+阅读 · 2018年4月3日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

More Like Real World Game Challenge for Partially Observable Multi-Agent Cooperation

Arxiv

0+阅读 · 2023年5月15日

Private and Communication-Efficient Algorithms for Entropy Estimation

Arxiv

0+阅读 · 2023年5月12日

Differentially Private Set-Based Estimation Using Zonotopes

Arxiv

0+阅读 · 2023年5月12日

Copula-based Sensitivity Analysis for Multi-Treatment Causal Inference with Unobserved Confounding

Arxiv

0+阅读 · 2023年5月11日

HoneyIoT: Adaptive High-Interaction Honeypot for IoT Devices Through Reinforcement Learning

Arxiv

0+阅读 · 2023年5月10日

Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization

Arxiv

0+阅读 · 2023年5月10日

Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Arxiv

0+阅读 · 2023年5月10日

Probabilistic Guarantees for Safe Reinforcement Learning in Continuous Action Spaces via Temporal Logic

Arxiv

0+阅读 · 2023年5月10日

Recent Advances in Reinforcement Learning in Finance

Arxiv

11+阅读 · 2021年12月8日

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

A Survey of Reinforcement Learning Techniques: Strategies, Recent Development, and Future Directions

Arxiv

80+阅读 · 2020年1月19日

相关基金

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

24+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

考虑总量折扣的运输服务采购问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于网络拓扑结构的随机多自主体系统分布式控制

国家自然科学基金

0+阅读 · 2013年12月31日

动态多摄像头环境中拥挤多目标跟踪的联合建模与协同优化

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据挖掘近似函数的智能制造装备生产决策参数稳健优化方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

多元函数的稀疏逼近与随机逼近

国家自然科学基金

1+阅读 · 2012年12月31日

基于贝叶斯推理的模糊逻辑强化学习模型研究

国家自然科学基金

18+阅读 · 2012年12月31日

数值求解最优控制：动态规划方法

国家自然科学基金

1+阅读 · 2009年12月31日

随机微分方程的逼近

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员