This paper studies reinforcement learning (RL) in infinite-horizon dynamic decision processes with almost-sure safety constraints. Such safety-constrained decision processes are central to applications in autonomous systems, finance, and resource management, where policies must satisfy strict, state-dependent constraints. We consider a doubly-regularized RL framework that combines reward and parameter regularization to address these constraints within continuous state-action spaces. Specifically, we formulate the problem as a convex regularized objective with parametrized policies in the mean-field regime. Our approach leverages recent developments in mean-field theory and Wasserstein gradient flows to model policies as elements of an infinite-dimensional statistical manifold, with policy updates evolving via gradient flows on the space of parameter distributions. Our main contributions include establishing solvability conditions for safety-constrained problems, defining smooth and bounded approximations that facilitate gradient flows, and demonstrating exponential convergence towards global solutions under sufficient regularization. We provide general conditions on regularization functions, encompassing standard entropy regularization as a special case. The results also enable a particle method implementation for practical RL applications. The theoretical insights and convergence guarantees presented here offer a robust framework for safe RL in complex, high-dimensional decision-making problems.
翻译:本文研究了在几乎必然安全约束下的无限时域动态决策过程中的强化学习(RL)问题。此类安全约束决策过程是自主系统、金融和资源管理等应用领域的核心,其中策略必须满足严格的、依赖于状态的约束。我们考虑一种双重正则化的RL框架,该框架结合了奖励正则化和参数正则化,以在连续状态-动作空间中处理这些约束。具体而言,我们将问题表述为在平均场机制下具有参数化策略的凸正则化目标函数。我们的方法利用平均场理论和Wasserstein梯度流的最新进展,将策略建模为无限维统计流形中的元素,策略更新通过参数分布空间上的梯度流进行演化。我们的主要贡献包括:为安全约束问题建立可解性条件;定义促进梯度流的平滑有界近似;以及在充分正则化条件下证明向全局解的指数收敛性。我们提供了关于正则化函数的一般性条件,其中标准熵正则化作为特例被涵盖。这些结果还支持一种粒子方法实现,可用于实际的RL应用。本文提出的理论见解和收敛性保证,为复杂高维决策问题中的安全RL提供了一个稳健的框架。