快速非回合制有限时域强化学习中的K步前瞻阈值方法 (Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding) - 专知论文

会员服务 ·

0

时域 · 阈值 · 回合 · 强化学习 · Q函数 ·

Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

翻译：快速非回合制有限时域强化学习中的K步前瞻阈值方法

Jiamin Xu,Kyra Gan

Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.

翻译：在非回合制、有限时域的马尔可夫决策过程中，在线强化学习的研究仍显不足，且面临估计固定终止时刻回报的挑战。现有的无限时域方法通常依赖于折扣收缩，无法自然地适应这种固定时域结构。我们引入一种改进的Q函数：不针对完整时域，而是学习一个K步前瞻Q函数，将规划截断至后续K步。为进一步提升样本效率，我们引入阈值机制：仅当估计的K步前瞻值超过时变阈值时，才选择相应动作。我们针对这一新目标提出一种高效的表格学习算法，证明其可实现快速的有限样本收敛：当$K=1$时达到极小极大最优的常数遗憾，对任意$K \geq 2$则获得$\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$遗憾。我们在最大化奖励的目标下对算法性能进行数值评估。我们的实现随时间自适应增加K值，以平衡前瞻深度与估计方差。在合成MDP及强化学习环境（JumpRiverswim、FrozenLake和AnyTrading）上的实证结果表明，本方法在累积奖励方面优于当前最先进的表格强化学习方法。

0

相关内容

【2023新书】《强化学习的艺术：基础知识、数学原理与Python实现》，290页pdf

【2023新书】《强化学习的艺术：基础知识、数学原理与Python实现》，290页pdf

专知会员服务

158+阅读 · 2023年12月18日

【NeurIPS2023】非稳态强化学习中的节奏适应

【NeurIPS2023】非稳态强化学习中的节奏适应

专知会员服务

18+阅读 · 2023年9月27日

【ICML2023】在受限逆强化学习中的可识别性和泛化能力

【ICML2023】在受限逆强化学习中的可识别性和泛化能力

专知会员服务

26+阅读 · 2023年6月5日

万字长文！离线强化学习(OfflineRL)总结(原理、数据集、算法、复杂性分析、超参数调优等）

万字长文！离线强化学习(OfflineRL)总结(原理、数据集、算法、复杂性分析、超参数调优等）

专知会员服务

42+阅读 · 2022年5月12日

深度学习组合优化，30页ppt，阿姆斯特丹Wouter Kool讲授

深度学习组合优化，30页ppt，阿姆斯特丹Wouter Kool讲授

专知会员服务

27+阅读 · 2021年2月27日

【ICLR2021】一种基于距离度量学习及行为正则化的完全离线的元强化学习方法

专知会员服务

17+阅读 · 2021年2月9日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

专知会员服务

134+阅读 · 2020年3月2日

《强化学习—使用 Open AI、TensorFlow和Keras实现》174页pdf

《强化学习—使用 Open AI、TensorFlow和Keras实现》174页pdf

专知会员服务

139+阅读 · 2020年3月1日

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

专知会员服务

13+阅读 · 2019年11月17日

图怎么用强化学习？东北大学最新《图强化学习》综述论文，54页pdf阐述GRL方法、数据与应用

图怎么用强化学习？东北大学最新《图强化学习》综述论文，54页pdf阐述GRL方法、数据与应用

专知

12+阅读 · 2022年4月14日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

注意力机制 | 图卷积多跳注意力机制 | Direct multi-hop Attention based GNN

注意力机制 | 图卷积多跳注意力机制 | Direct multi-hop Attention based GNN

AINLP

22+阅读 · 2020年11月29日

强化学习开篇：Q-Learning原理详解

强化学习开篇：Q-Learning原理详解

AINLP

37+阅读 · 2020年7月28日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

548页MIT强化学习教程，收藏备用【PDF下载】

548页MIT强化学习教程，收藏备用【PDF下载】

机器学习算法与Python学习

17+阅读 · 2018年10月11日

基于 Keras 用深度学习预测时间序列

基于 Keras 用深度学习预测时间序列

R语言中文社区

23+阅读 · 2018年7月27日

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

大数据文摘

11+阅读 · 2018年6月12日

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

专知

12+阅读 · 2018年5月18日

回归预测&时间序列预测

回归预测&时间序列预测

GBASE数据工程部数据团队

44+阅读 · 2017年5月17日

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

42+阅读 · 2015年12月31日

新型快速高稳定性时域积分方程算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

23+阅读 · 2015年12月31日

非线性双曲方程的间断有限元超收敛分析和应用

国家自然科学基金

1+阅读 · 2015年12月31日

非局部总变差正则化图像恢复模型的快速子空间校正算法

国家自然科学基金

0+阅读 · 2014年12月31日

求解非线性方程的加速迭代算法

国家自然科学基金

0+阅读 · 2014年12月31日

强非线性偏微分方程基于梯度重构的新型算法

国家自然科学基金

0+阅读 · 2014年12月31日

非线性约束全局优化的新方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

一些几何发展方程中的渐近分析研究

国家自然科学基金

0+阅读 · 2014年12月31日

分数阶偏微分方程与近场动力学等非局部模型的高保真快速算法与数值分析

国家自然科学基金

1+阅读 · 2014年12月31日

Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

Arxiv

0+阅读 · 2月19日

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Arxiv

0+阅读 · 2月18日

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Arxiv

0+阅读 · 2月16日

QuRL: Efficient Reinforcement Learning with Quantized Rollout

Arxiv

0+阅读 · 2月15日

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Arxiv

0+阅读 · 2月12日

Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics

Arxiv

0+阅读 · 2月6日

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Arxiv

0+阅读 · 2月6日

Stochastic Decision Horizons for Constrained Reinforcement Learning

Arxiv

0+阅读 · 2月4日

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Arxiv

0+阅读 · 2月3日

Reinforcement Learning for Control Systems with Time Delays: A Comprehensive Survey

Arxiv

0+阅读 · 1月30日

VIP会员

文章信息

相关主题

相关VIP内容

【2023新书】《强化学习的艺术：基础知识、数学原理与Python实现》，290页pdf

【2023新书】《强化学习的艺术：基础知识、数学原理与Python实现》，290页pdf

专知会员服务

158+阅读 · 2023年12月18日

【NeurIPS2023】非稳态强化学习中的节奏适应

【NeurIPS2023】非稳态强化学习中的节奏适应

专知会员服务

18+阅读 · 2023年9月27日

【ICML2023】在受限逆强化学习中的可识别性和泛化能力

【ICML2023】在受限逆强化学习中的可识别性和泛化能力

专知会员服务

26+阅读 · 2023年6月5日

万字长文！离线强化学习(OfflineRL)总结(原理、数据集、算法、复杂性分析、超参数调优等）

万字长文！离线强化学习(OfflineRL)总结(原理、数据集、算法、复杂性分析、超参数调优等）

专知会员服务

42+阅读 · 2022年5月12日

深度学习组合优化，30页ppt，阿姆斯特丹Wouter Kool讲授

深度学习组合优化，30页ppt，阿姆斯特丹Wouter Kool讲授

专知会员服务

27+阅读 · 2021年2月27日

【ICLR2021】一种基于距离度量学习及行为正则化的完全离线的元强化学习方法

专知会员服务

17+阅读 · 2021年2月9日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

【阿里巴巴-达摩院】深度学习的时间序列数据增强综述，Time Series Data Augmentation for Deep Learning: A Survey

专知会员服务

134+阅读 · 2020年3月2日

《强化学习—使用 Open AI、TensorFlow和Keras实现》174页pdf

《强化学习—使用 Open AI、TensorFlow和Keras实现》174页pdf

专知会员服务

139+阅读 · 2020年3月1日

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

专知会员服务

13+阅读 · 2019年11月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《可信人工智能赋能系统的支柱》

《从经典神经网络到不确定性下的拓扑神经网络：军事应用》2026最新40页报告

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

《人工智能：对战略与力量的影响》slides

相关资讯

图怎么用强化学习？东北大学最新《图强化学习》综述论文，54页pdf阐述GRL方法、数据与应用

图怎么用强化学习？东北大学最新《图强化学习》综述论文，54页pdf阐述GRL方法、数据与应用

专知

12+阅读 · 2022年4月14日

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

【NeurIPS 2020 Tutorial】离线强化学习:从算法到挑战，80页ppt

专知

16+阅读 · 2020年12月9日

注意力机制 | 图卷积多跳注意力机制 | Direct multi-hop Attention based GNN

注意力机制 | 图卷积多跳注意力机制 | Direct multi-hop Attention based GNN

AINLP

22+阅读 · 2020年11月29日

强化学习开篇：Q-Learning原理详解

强化学习开篇：Q-Learning原理详解

AINLP

37+阅读 · 2020年7月28日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

548页MIT强化学习教程，收藏备用【PDF下载】

548页MIT强化学习教程，收藏备用【PDF下载】

机器学习算法与Python学习

17+阅读 · 2018年10月11日

基于 Keras 用深度学习预测时间序列

基于 Keras 用深度学习预测时间序列

R语言中文社区

23+阅读 · 2018年7月27日

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

资源 | 跟着Sutton经典教材学强化学习中的蒙特卡罗方法（代码实例）

大数据文摘

11+阅读 · 2018年6月12日

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

【论文推荐】最新十篇度量学习相关论文—可量化表示、非线性度量学习、在线深度量学习、大间隔最近邻、判别深度度量、域自适应

专知

12+阅读 · 2018年5月18日

回归预测&时间序列预测

回归预测&时间序列预测

GBASE数据工程部数据团队

44+阅读 · 2017年5月17日

相关论文

Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

Arxiv

0+阅读 · 2月19日

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Arxiv

0+阅读 · 2月18日

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Arxiv

0+阅读 · 2月16日

QuRL: Efficient Reinforcement Learning with Quantized Rollout

Arxiv

0+阅读 · 2月15日

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Arxiv

0+阅读 · 2月12日

Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics

Arxiv

0+阅读 · 2月6日

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Arxiv

0+阅读 · 2月6日

Stochastic Decision Horizons for Constrained Reinforcement Learning

Arxiv

0+阅读 · 2月4日

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Arxiv

0+阅读 · 2月3日

Reinforcement Learning for Control Systems with Time Delays: A Comprehensive Survey

Arxiv

0+阅读 · 1月30日

相关基金

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

42+阅读 · 2015年12月31日

新型快速高稳定性时域积分方程算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

23+阅读 · 2015年12月31日

非线性双曲方程的间断有限元超收敛分析和应用

国家自然科学基金

1+阅读 · 2015年12月31日

非局部总变差正则化图像恢复模型的快速子空间校正算法

国家自然科学基金

0+阅读 · 2014年12月31日

求解非线性方程的加速迭代算法

国家自然科学基金

0+阅读 · 2014年12月31日

强非线性偏微分方程基于梯度重构的新型算法

国家自然科学基金

0+阅读 · 2014年12月31日

非线性约束全局优化的新方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

一些几何发展方程中的渐近分析研究

国家自然科学基金

0+阅读 · 2014年12月31日

分数阶偏微分方程与近场动力学等非局部模型的高保真快速算法与数值分析

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员