Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average criterion-constrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by the famed PPO-type algorithms based on trust region methods. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other state-of-the-art algorithms adapted for the average CMDP setting.
翻译:带约束的强化学习在各类应用中正日益重要。通常,平均准则比折扣准则更为适用。然而,针对平均准则约束马尔可夫决策过程的强化学习仍是一个具有挑战性的问题。为折扣约束强化学习问题设计的算法在平均约束马尔可夫决策设置中往往表现不佳。本文针对平均准则下的约束马尔可夫决策过程,引入了一种带有函数近似的策略优化算法。平均约束策略优化算法受基于信赖域方法的著名PPO类算法启发。我们发展了平均马尔可夫决策过程的基础灵敏度理论,并将相应的界限应用于算法设计中。我们为该算法提供了理论性能保证,并在多种具有挑战性的MuJoCo环境中通过大量实验表明,与针对平均约束马尔可夫决策设置的其他先进算法相比,该算法具有更优越的性能。