We study online learning problems in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints. We consider two different scenarios. In the first one, we address general CMDPs, where we design an algorithm that attains sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Indeed, previous works either focus on much weaker soft constraints--allowing for positive violation to cancel out negative ones--or are restricted to stochastic losses. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with state-of-the-art algorithms. This enables their adoption in a much wider range of real-world applications, ranging from autonomous driving to online advertising and recommender systems.
翻译:摘要:本文研究了在具有对抗性损失和随机硬约束的约束马尔可夫决策过程(CMDPs)中的在线学习问题。我们考虑两种不同场景。第一种场景针对一般CMDPs,设计了一种能够实现次线性遗憾并累积正约束违反的算法。第二种场景在弱假设——存在严格满足约束的策略且该策略为学习者所知——下,我们设计了一种算法,该算法在确保每个回合中约束以高概率被满足的同时,实现了次线性遗憾。据我们所知,本文首次研究了同时涉及对抗性损失和硬约束的CMDPs。事实上,现有工作要么聚焦于弱得多的软约束(允许正违反与负违反相互抵消),要么局限于随机损失。因此,我们的算法能够处理普遍的非平稳环境,且其满足的要求比现有算法所能处理的严格得多。这使得它们能够适用于更广泛的真实应用场景,从自动驾驶到在线广告和推荐系统。