Near-continuous time Reinforcement Learning for continuous state-action spaces

We consider the Reinforcement Learning problem of controlling an unknown dynamical system to maximise the long-term average reward along a single trajectory. Most of the literature considers system interactions that occur in discrete time and discrete state-action spaces. Although this standpoint is suitable for games, it is often inadequate for mechanical or digital systems in which interactions occur at a high frequency, if not in continuous time, and whose state spaces are large if not inherently continuous. Perhaps the only exception is the Linear Quadratic framework for which results exist both in discrete and continuous time. However, its ability to handle continuous states comes with the drawback of a rigid dynamic and reward structure. This work aims to overcome these shortcomings by modelling interaction times with a Poisson clock of frequency $\varepsilon^{-1}$, which captures arbitrary time scales: from discrete ($\varepsilon=1$) to continuous time ($\varepsilon\downarrow0$). In addition, we consider a generic reward function and model the state dynamics according to a jump process with an arbitrary transition kernel on $\mathbb{R}^d$. We show that the celebrated optimism protocol applies when the sub-tasks (learning and planning) can be performed effectively. We tackle learning within the eluder dimension framework and propose an approximate planning method based on a diffusive limit approximation of the jump process. Overall, our algorithm enjoys a regret of order $\tilde{\mathcal{O}}(\varepsilon^{1/2} T+\sqrt{T})$. As the frequency of interactions blows up, the approximation error $\varepsilon^{1/2} T$ vanishes, showing that $\tilde{\mathcal{O}}(\sqrt{T})$ is attainable in near-continuous time.

翻译：我们考虑控制未知动力系统以最大化单条轨迹长期平均回报的强化学习问题。现有文献多数考虑离散时间及离散状态-动作空间下的系统交互。尽管这一视角适用于游戏场景，但对于机械或数字系统——其交互频率极高甚至为连续时间、状态空间巨大或本质连续——往往力有不逮。唯一的例外或许是线性二次型框架，其离散及连续时间下均有结果，但连续状态处理能力以刚性动态与奖励结构为代价。本研究旨在通过泊松时钟（频率$\varepsilon^{-1}$）建模交互时间以克服上述缺陷，该时钟可捕捉从离散时间（$\varepsilon=1$）到连续时间（$\varepsilon\downarrow0$）的任意时间尺度。此外，我们考虑通用奖励函数，并根据$\mathbb{R}^d$上任意转移核的跳跃过程建模状态动态。研究表明，当子任务（学习与规划）可高效执行时，著名的乐观协议依然适用。我们在eluder维度框架内处理学习问题，并提出基于跳跃过程扩散极限近似的近似规划方法。整体而言，我们的算法实现了$\tilde{\mathcal{O}}(\varepsilon^{1/2} T+\sqrt{T})$量级的遗憾。随着交互频率趋于无穷，近似误差$\varepsilon^{1/2} T$消失，表明近连续时间下$\tilde{\mathcal{O}}(\sqrt{T})$是可实现的。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日