This paper explores continuous-time and state-space optimal stopping problems from a reinforcement learning perspective. We begin by formulating the stopping problem using randomized stopping times, where the decision maker's control is represented by the probability of stopping within a given time-specifically, a bounded, non-decreasing, càdlàg control process. To encourage exploration and facilitate learning, we introduce a regularized version of the problem by penalizing the performance criterion with the cumulative residual entropy of the randomized stopping time. The regularized problem takes the form of an (n+1)-dimensional degenerate singular stochastic control with finite-fuel, where the regularized free boundary becomes the graph of a function mapping the state variable of the original stopping problem into the probability of stopping. We address this singular control problem through the dynamic programming principle, which enables us to identify the unique optimal exploratory strategy. Finally, we propose both model-based and model-free reinforcement learning algorithms tailored for exploratory optimal stopping problems. We establish policy improvement guarantees for the proposed algorithms. Moreover, the model-free method is of actor-critic type and it is scalable in high-dimensions under neural network parameterization.
翻译:本文从强化学习的角度探讨连续时间与状态空间的最优停止问题。我们首先使用随机化停止时间来表述停止问题,其中决策者的控制由给定时间内停止的概率表示——具体而言,这是一个有界、非递减的右连左极控制过程。为促进探索并辅助学习,我们通过用随机化停止时间的累积残差熵惩罚性能准则,引入了该问题的正则化版本。正则化问题呈现为具有有限燃料的(n+1)维退化奇异随机控制形式,其中正则化自由边界成为将原始停止问题的状态变量映射至停止概率的函数的图像。我们通过动态规划原理处理这一奇异控制问题,从而得以识别唯一的最优探索策略。最后,我们提出了专为探索式最优停止问题设计的基于模型与无模型强化学习算法。我们为所提算法建立了策略改进保证。此外,该无模型方法属于演员-评论家类型,在神经网络参数化下具备高维可扩展性。