Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
翻译:针对带约束的马尔可夫决策过程(CMDPs)的强化学习(RL)在各种应用中日益重要。通常,平均准则比折扣准则更为适用。然而,针对平均-CMDPs(ACMDPs)的强化学习仍然是一个具有挑战性的问题。为折扣约束强化学习问题设计的算法通常在平均CMDP设定下表现不佳。本文针对采用平均准则的约束MDP,提出了一种新的基于函数近似的策略优化算法。平均约束策略优化(ACPO)算法受到基于信任域的策略优化算法的启发。我们发展了平均CMDP的基本灵敏度理论,并在算法设计中使用了相应的边界条件。我们提供了其性能的理论保证,并通过在各种具有挑战性的OpenAI Gym环境中进行的大量实验工作,展示了其在与其他为ACMDPs适配的最先进算法相比时,具有更优越的实证性能。