We study online learning in episodic constrained Markov decision processes (CMDPs), where the goal of the learner is to collect as much reward as possible over the episodes, while guaranteeing that some long-term constraints are satisfied during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical unconstrained MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints. Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.
翻译:我们研究在线学习在情景受限马尔可夫决策过程(CMDPs)中的应用,其学习目标是在保证学习过程中满足某些长期约束的前提下,通过各情景收集尽可能多的奖励。奖励与约束可随机或对抗性选择,且转移函数对学习者未知。尽管经典无约束MDPs的在线学习在过去几年已获得广泛关注,但CMDPs的设置仍鲜有探索。这令人意外,因为在自动驾驶、自动竞价和推荐系统等实际应用中,智能体通常需要在学习过程中遵守额外的约束与规范。本文提出首个针对长期约束CMDPs的最优双世界算法。该算法能够处理奖励与约束为随机或对抗性选择的情形,且无需了解底层过程。此外,当约束为随机选择时,该算法达到了当前最优的遗憾值与约束违反界限;而在约束为对抗性选择时,该算法首次提供了相应的理论保证。