The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation in our algorithm which is essential when the state and action spaces are large or continuous and use the policy gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal policy. We also compare and analyze the performance of our algorithm through experiments and show that our algorithm performs better than some other well known algorithms.
翻译:无限时段设定被广泛应用于强化学习问题。这些设定不可避免地导致最优的平稳策略。在许多情况下,有限时段控制问题更具研究意义,而这类问题的最优策略通常具有时变特性。近年来,约束强化学习成为热门研究领域,其中智能体在最大化奖励的同时需要满足给定的约束条件。然而,现有研究仅针对平稳策略最优的无限时段MDP展开。本文提出了针对有限时段设定的约束强化学习算法,其中时段在固定(有限)时间后终止。我们的算法采用函数逼近技术(这对于状态空间与动作空间较大或连续时至关重要),并利用策略梯度方法寻找最优策略。所获最优策略依赖于阶段,因此通常是非平稳的。据我们所知,本文首次提出面向具有约束的有限时段设定的策略梯度算法。我们证明了算法收敛至约束最优策略,并通过实验对比分析算法性能,表明该算法优于其他已知算法。