Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the transition functions are and can be at most ${O}(T)$. While this algorithm itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Jin et al.[2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{\textsf{L}}$ is the amount of corruption on losses.
翻译:现有针对对抗性马尔可夫决策过程的在线学习算法能够在$T$轮交互后实现${O}(\sqrt{T})$的遗憾界,即使损失函数由对手任意选取,但前提是状态转移函数必须固定。这是因为已有研究表明,对抗性状态转移函数使得无遗憾学习不可能实现。尽管存在此类不可能性结果,本文仍提出能同时处理对抗性损失与对抗性状态转移的算法,且其遗憾值随对手恶意程度的增加而平缓增长。具体而言,我们首先提出一种遗憾界为$\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$的算法,其中$C^{\textsf{P}}$衡量状态转移函数的对抗程度,最大可达${O}(T)$。虽该算法本身需要已知$C^{\textsf{P}}$,我们进一步开发黑盒归约方法消除此需求。此外,我们证明对算法的进一步优化不仅能维持相同遗憾界,还能同时适应更简单环境(如Jin等人[2021]中损失函数受某种随机约束生成的情况),实现$\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$的遗憾界,其中$U$为标准间隙依赖系数,$C^{\textsf{L}}$为损失函数受污染程度。