We propose and analyze a continuous-time robust reinforcement learning framework for optimal stopping under ambiguity. In this framework, an agent chooses a robust exploratory stopping time motivated by two objectives: robust decision-making under ambiguity and learning about the unknown environment. Here, ambiguity refers to considering multiple probability measures dominated by a reference measure, reflecting the agent's awareness that the reference measure representing her learned belief about the environment would be erroneous. Using the $g$-expectation framework, we reformulate the optimal stopping problem under ambiguity as a robust exploratory control problem with Bernoulli distributed controls. We then characterize the optimal Bernoulli distributed control via backward stochastic differential equations and, based on this, construct the robust exploratory stopping time that approximates the optimal stopping time under ambiguity. Last, we establish a policy iteration theorem and implement it as a reinforcement learning algorithm. Numerical experiments demonstrate the convergence, robustness, and scalability of our reinforcement learning algorithm across different levels of ambiguity and exploration.
翻译:我们提出并分析了一种连续时间下的模糊性最优停时鲁棒强化学习框架。在该框架中,智能体基于两个目标选择鲁棒探索式停时:模糊性下的鲁棒决策以及未知环境的探索学习。此处,模糊性指考虑由参考测度支配的多个概率测度,反映了智能体意识到代表其对环境学习信念的参考测度可能存在偏差。利用g-期望框架,我们将模糊性下的最优停时问题重新表述为带有伯努利分布控制的鲁棒探索控制问题。随后,通过倒向随机微分方程刻画最优伯努利分布控制,并在此基础上构造出逼近模糊性下最优停时的鲁棒探索式停时。最后,我们建立了策略迭代定理,并将其实现为强化学习算法。数值实验展示了该强化学习算法在不同模糊性与探索水平下的收敛性、鲁棒性与可扩展性。