In this paper, we consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT})$, where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $\sqrt{S}$, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $\Omega(\sqrt{H^3SAT})$ as far as the dependencies in $S,A,T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).
翻译:本文研究了在完全信息设置下,面对非适应性对手的对抗性马尔可夫决策过程(MDP)中的学习问题。智能体在 $T$ 个回合中与环境交互,每个回合包含 $H$ 个阶段,且每个回合的评估仅基于回合结束时才揭晓的奖励函数。我们提出了一种名为 APO-MVP 的算法,其遗憾界为 $\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT})$,其中 $S$ 和 $A$ 分别为状态空间和动作空间的规模。该结果将已知最优遗憾界改进了 $\sqrt{S}$ 倍,从而弥合了对抗性与随机 MDP 之间的差距,并且在 $S$、$A$、$T$ 的依赖关系上达到了极小极大下界 $\Omega(\sqrt{H^3SAT})$。所提出的算法与分析完全避免了通常基于占用度量的工具;相反,它仅基于动态规划以及在估计的优势函数上运行的黑盒在线线性优化策略进行策略优化,使其易于实现。该分析利用了两种近期技术:基于在线线性优化策略的策略优化(Jonckheere 等人,2023)以及对估计转移核影响价值函数的精细化鞅分析(Zhang 等人,2023)。