We study continuous-time reinforcement learning (RL) for stochastic control in which system dynamics are governed by jump-diffusion processes. We formulate an entropy-regularized exploratory control problem with stochastic policies to capture the exploration--exploitation balance essential for RL. Unlike the pure diffusion case initially studied by Wang et al. (2020), the derivation of the exploratory dynamics under jump-diffusions calls for a careful formulation of the jump part. Through a theoretical analysis, we find that one can simply use the same policy evaluation and $q$-learning algorithms in Jia and Zhou (2022a, 2023), originally developed for controlled diffusions, without needing to check a priori whether the underlying data come from a pure diffusion or a jump-diffusion. However, we show that the presence of jumps ought to affect parameterizations of actors and critics in general. We investigate as an application the mean--variance portfolio selection problem with stock price modelled as a jump-diffusion, and show that both RL algorithms and parameterizations are invariant with respect to jumps. Finally, we present a detailed study on applying the general theory to option hedging.
翻译:本文研究连续时间强化学习在随机控制中的应用,其中系统动力学由跳跃-扩散过程描述。我们提出了一个基于随机策略的熵正则化探索性控制问题,以捕捉强化学习中至关重要的探索-利用平衡。与Wang等人(2020)最初研究的纯扩散情形不同,跳跃-扩散情形下探索性动力学的推导需要对跳跃部分进行严谨的数学表述。通过理论分析,我们发现可以直接采用Jia与Zhou(2022a, 2023)为受控扩散过程设计的策略评估与$q$-学习算法,而无需先验判断底层数据来源于纯扩散过程还是跳跃-扩散过程。然而,我们证明跳跃的存在通常会影响执行器与评价器的参数化设计。作为应用案例,我们研究了股价建模为跳跃-扩散过程的均值-方差投资组合选择问题,并证明该场景下强化学习算法与参数化形式均对跳跃具有不变性。最后,我们详细探讨了将一般理论应用于期权对冲的具体方法。