We propose a comprehensive framework for policy gradient methods tailored to continuous time reinforcement learning. This is based on the connection between stochastic control problems and randomised problems, enabling applications across various classes of Markovian continuous time control problems, beyond diffusion models, including e.g. regular, impulse and optimal stopping/switching problems. By utilizing change of measure in the control randomisation technique, we derive a new policy gradient representation for these randomised problems, featuring parametrised intensity policies. We further develop actor-critic algorithms specifically designed to address general Markovian stochastic control issues. Our framework is demonstrated through its application to optimal switching problems, with two numerical case studies in the energy sector focusing on real options.
翻译:我们提出了一套针对连续时间强化学习的策略梯度方法综合框架。该框架基于随机控制问题与随机化问题之间的关联,能够应用于扩散模型之外的各类马尔可夫连续时间控制问题,包括常规控制、脉冲控制、最优停止/切换问题等。通过利用控制随机化技术中的测度变换,我们为这些随机化问题推导出了一种新的策略梯度表示形式,其中包含参数化强度策略。我们进一步开发了专门用于处理一般马尔可夫随机控制问题的行动者-评论家算法。通过其在最优切换问题中的应用,以及在能源领域围绕实物期权的两个数值案例研究,验证了我们所提出框架的有效性。