q-Learning in Continuous Time

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

翻译：我们研究在Wang等人(2020)提出的熵正则化探索性扩散过程框架下，强化学习(RL)中Q学习的连续时间对应方法。由于传统(大)Q函数在连续时间中失效，我们考虑其一阶近似，并首次提出"(小)q函数"这一术语。该函数与瞬时优势率函数及哈密顿量相关。我们建立了独立于时间离散化的q函数"q学习"理论。在给定随机策略下，通过某种随机过程的鞅条件，我们在同策略和异策略两种场景中联合刻画了关联的q函数与值函数。随后，根据是否能显式计算q函数生成的吉布斯测度密度函数，我们应用该理论设计了不同的Actor-Critic算法来解决底层RL问题。其中一种算法诠释了著名的Q学习算法SARSA，另一种算法则推导出Jia和Zhou(2022b)提出的基于策略梯度(PG)的连续时间算法。最后，我们通过仿真实验，将所提算法与Jia和Zhou(2022b)中基于PG的算法以及时间离散化传统Q学习算法的性能进行了比较。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

不可错过！杜克大学《因果推断》课程，全面讲述因果推理

专知会员服务

52+阅读 · 2022年10月22日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

55+阅读 · 2020年9月7日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日