连接离散与连续强化学习：基于鞅表征的稳定确定性策略梯度 (Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization)

The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.

翻译：离散时间强化学习（RL）理论在过去几十年中发展迅速。尽管主要针对离散环境设计，但许多现实世界的RL应用本质上是连续且复杂的。将离散时间算法扩展到连续时间设置的一个主要挑战是其对时间离散化的敏感性，这通常导致稳定性差和收敛速度慢。本文研究了连续时间RL中的确定性策略梯度方法。我们基于优势函数的类比推导了一个连续时间策略梯度公式，并建立了其鞅表征。这一理论基础引出了我们提出的算法——CT-DDPG，该算法能够在连续时间环境中实现确定性策略的稳定学习。数值实验表明，与现有的离散时间和连续时间方法相比，所提出的CT-DDPG算法在具有不同时间离散化和噪声水平的广泛控制任务中，提供了更好的稳定性和更快的收敛速度。