Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{O}(K^{6/7})$ ($K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{O}(K^{4/5})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.
翻译:在对抗环境中学习马尔可夫决策过程(MDP)一直是一个具有挑战性的问题。当引入函数近似时,该问题变得更加困难,因为损失函数和转移核的底层结构在变化的环境中尤其难以估计。事实上,线性对抗MDP的最新研究成果仅达到$\tilde{O}(K^{6/7})$的遗憾界(其中$K$表示幕数),这仍有很大的改进空间。本文从一个新视角研究该问题,该视角通过巧妙地设置线性优化中赌博机臂的特征映射,将线性MDP简化为线性优化问题。在探索性假设下,这种新技术在不使用转移模拟器的情况下,为线性对抗MDP提供了改进的$\tilde{O}(K^{4/5})$遗憾界。这一新视角可能对其他具有线性结构的MDP问题的求解具有独立的研究价值。