Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
翻译:强化学习(Reinforcement Learning, RL)作为一种通用的序列决策框架,在机器人学、自动驾驶、推荐系统、供应链优化、生物学、力学及金融等诸多领域均有广泛应用。这些应用的核心目标在于最大化平均奖励。实际场景中的学习过程通常还需满足特定的约束条件。本专著聚焦于在平均奖励马尔可夫决策过程(Markov Decision Processes, MDPs)的背景下,探讨带约束强化学习的多种基于模型与无模型方法。研究首先审视基于模型的策略,深入探讨两种基础方法——面对不确定性的乐观策略以及后验采样。随后,讨论转向参数化无模型方法,其中研究了基于原始-对偶策略梯度的算法,作为带约束MDPs的一种解决方案。本专著为所讨论的每种设置提供了遗憾界保证,并分析了约束违反情况。在上述探讨中,我们假设底层MDP是遍历的。此外,本专著进一步将讨论延伸至针对弱连通MDPs的专门结果,从而拓宽了研究结论的适用范围及其在更广泛实际场景中的相关性。