Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable. Yet, RL for average criterion-constrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new (possibly the first) policy optimization algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by the famed PPO-type algorithms based on trust region methods. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other state-of-the-art algorithms adapted for the average CMDP setting.
翻译:强化学习中的约束问题正成为各类应用日益重要的课题。通常,平均准则更为适用。然而,针对平均准则约束马尔可夫决策过程的强化学习仍具挑战性。为折扣约束强化学习问题设计的算法,在平均约束马尔可夫决策过程中往往表现不佳。本文提出了一种新的(可能是首个)用于平均准则约束马尔可夫决策过程的策略优化算法。平均约束策略优化(ACPO)算法受著名的基于信赖域方法的PPO型算法启发。我们发展了平均马尔可夫决策过程的基础灵敏度理论,并在算法设计中使用了相应的界限。我们提供了算法性能的理论保证,并通过在多种具有挑战性的MuJoCo环境中的广泛实验,展示了该算法在平均约束马尔可夫决策过程中与其他最优算法的优越性能。