We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for any time-space recursive (TSR) cost criteria. A TSR criteria requires the cost of a policy to be computable recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work answers three open questions spanning two long-standing lines of research: polynomial-time approximability is possible for 1) anytime-constrained policies, 2) almost-sure-constrained policies, and 3) deterministic expectation-constrained policies.
翻译:我们提出了一种新颖算法,可高效计算约束强化学习(CRL)问题的近似最优确定性策略。该方法融合了三个核心思想:(1)价值需求增强,(2)动作空间近似动态规划,以及(3)时空舍入。本算法为任意时空递归(TSR)成本准则构建了完全多项式时间近似方案(FPTAS)。TSR准则要求策略成本在时间和(状态)空间上均可递归计算,该框架涵盖了经典期望约束、几乎必然约束和任意时间约束。本研究解决了横跨两个长期研究脉络的三个开放性问题:多项式时间近似算法在以下情形中可实现:1)任意时间约束策略,2)几乎必然约束策略,以及3)确定性期望约束策略。