We study online learning problems in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints. We consider two different scenarios. In the first one, we address general CMDPs, where we design an algorithm that attains sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Indeed, previous works either focus on much weaker soft constraints--allowing for positive violation to cancel out negative ones--or are restricted to stochastic losses. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with state-of-the-art algorithms. This enables their adoption in a much wider range of real-world applications, ranging from autonomous driving to online advertising and recommender systems.
翻译:我们研究了在带有对抗性损失和随机硬约束的约束马尔可夫决策过程(CMDPs)中的在线学习问题。我们考虑了两种不同的场景。在第一种场景中,我们处理一般的CMDPs,设计了一种能够实现次线性后悔和累积正约束违反的算法。在第二种场景中,在存在严格满足约束的策略且学习者已知这一策略的温和假设下,我们设计了一种算法,该算法在确保每个回合以高概率满足约束的同时,实现了次线性后悔。据我们所知,我们的工作是首个同时涉及对抗性损失和硬约束的CMDPs研究。实际上,先前的工作要么专注于更弱的软约束——允许正违反抵消负违反——要么局限于随机损失。因此,我们的算法能够处理普遍的非平稳环境,这些环境需要比现有最先进算法所能管理的更严格的要求。这使得它们能够在更广泛的实际应用中被采用,从自动驾驶到在线广告和推荐系统。