Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the transition functions are and can be at most ${O}(T)$. While this algorithm itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Jin et al. [2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{\textsf{L}}$ is the amount of corruption on losses.
翻译:现有针对对抗性马尔可夫决策过程的在线学习算法,在与对手交互T轮后,即使损失函数由对手任意选择,也能实现${O}(\sqrt{T})$的遗憾值,但前提是状态转移函数必须固定。这是因为已有研究表明,对抗性状态转移函数会导致无法实现无遗憾学习。尽管存在此类不可能性结果,本文仍开发了能够同时处理对抗性损失和对抗性状态转移的算法,其遗憾值随对手恶意程度的增加而平滑增长。具体而言:首先,我们提出一种算法,其遗憾界为$\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$,其中$C^{\textsf{P}}$衡量状态转移函数的对抗性程度,且最大可达${O}(T)$。尽管该算法本身需要已知$C^{\textsf{P}}$,我们进一步开发了一种黑盒约简方法,消除了这一要求。此外,我们还证明对算法的进一步改进不仅能保持相同遗憾界,还能同时自适应更简单的环境(如Jin等人[2021]中损失函数以特定随机约束方式生成的情况),实现$\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$的遗憾值,其中$U$是某些标准间隙相关系数,$C^{\textsf{L}}$表示损失上的污染程度。