Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
翻译:强化学习(RL)作为一个通用的序列决策框架,在机器人学、自动驾驶、推荐系统、供应链优化、生物学、力学和金融等诸多领域均有应用。这些应用中的核心目标在于最大化平均奖励。实际场景通常要求在学习过程中遵循特定的约束条件。本专著聚焦于在平均奖励马尔可夫决策过程(MDP)的背景下,探索带约束强化学习的多种基于模型及无模型方法。研究首先审视基于模型的策略,深入探讨两种基础方法——面对不确定性的乐观策略与后验采样。随后,讨论转向参数化无模型方法,其中研究了基于原始-对偶策略梯度的算法作为带约束MDP的求解方案。本专著对所讨论的每种设置均给出了遗憾保证并分析了约束违反情况。在上述探索中,我们假设底层MDP具有遍历性。此外,本专著进一步将讨论延伸至针对弱连通MDP的定制化结果,从而拓宽了研究发现的适用范围及其在更广泛实际场景中的相关性。