We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Under mild reward assumptions, our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for a diverse class of cost criteria. This class requires that the cost of a policy can be computed recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work not only provides provably efficient algorithms to address real-world challenges in decision-making but also offers a unifying theory for the efficient computation of constrained deterministic policies.
翻译:我们提出了一种新颖算法,能够高效计算约束强化学习(CRL)问题的近似最优确定性策略。该方法融合了三个核心思想:(1)价值需求增强,(2)动作空间近似动态规划,以及(3)时空舍入。在温和的奖励假设下,该算法为广泛类型的成本准则构建了完全多项式时间近似方案(FPTAS)。此类成本准则要求策略的成本能够在时间和(状态)空间上递归计算,涵盖了经典期望约束、几乎必然约束以及任意时间约束。我们的工作不仅为实际决策挑战提供了可证明的高效算法,还为约束确定性策略的高效计算建立了统一的理论框架。