The theory of continuous-time reinforcement learning (RL) has progressed rapidly in recent years. While the ultimate objective of RL is typically to learn deterministic control policies, most existing continuous-time RL methods rely on stochastic policies. Such approaches often require sampling actions at very high frequencies, and involve computationally expensive expectations over continuous action spaces, resulting in high-variance gradient estimates and slow convergence. In this paper, we introduce and develop deterministic policy gradient (DPG) methods for continuous-time RL. We derive a continuous-time policy gradient formula expressed as the expected gradient of an advantage rate function and establish a martingale characterization for both the value function and the advantage rate. These theoretical results provide tractable estimators for deterministic policy gradients in continuous-time RL. Building on this foundation, we propose a model-free continuous-time Deep Deterministic Policy Gradient (CT-DDPG) algorithm that enables stable learning for general reinforcement learning problems with continuous time-and-state. Numerical experiments show that CT-DDPG achieves superior stability and faster convergence compared to existing stochastic-policy methods, across a wide range of learning tasks with varying time discretizations and noise levels.
翻译:近年来,连续时间强化学习理论取得了快速发展。尽管强化学习的最终目标通常是学习确定性控制策略,但现有的大多数连续时间强化学习方法都依赖于随机策略。这类方法通常需要以极高的频率对动作进行采样,并涉及对连续动作空间进行期望计算,计算开销巨大,从而导致梯度估计方差高、收敛速度慢。本文针对连续时间强化学习引入并发展了确定性策略梯度方法。我们推导出一个连续时间策略梯度公式,该公式可表示为优势率函数梯度的期望,并建立了价值函数与优势率的鞅刻画。这些理论结果为连续时间强化学习中的确定性策略梯度提供了易于处理的估计量。在此基础上,我们提出了一种免模型的连续时间深度确定性策略梯度算法,该算法能够为具有连续时间和状态的通用强化学习问题实现稳定学习。数值实验表明,在具有不同时间离散化程度和噪声水平的一系列学习任务中,CT-DDPG相比现有的随机策略方法具有更优的稳定性和更快的收敛速度。