We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of $\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$ where $k$ depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result to a NE. To our best knowledge, this is first work leading to the last iteration result in OMDPs.
翻译:我们研究在线马尔可夫决策过程(OMDPs)中的一种新型场景:损失函数由遵循无外部遗憾算法的非遗忘策略性对手选择。在此场景中,我们首先证明现有适用于遗忘对手的算法MDP-Expert仍可应用,并达到策略遗憾界$\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})$,其中$L$为对手纯策略集大小,$|A|$表示智能体动作空间规模。考虑到实际博弈中纳什均衡支撑集较小的特性,我们进一步提出新算法:MDP-在线神谕专家(MDP-OOE),其策略遗憾界为$\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$,其中$k$仅取决于纳什均衡支撑集大小。MDP-OOE利用博弈论中双神谕机制的核心优势,可解决动作空间规模过大的博弈问题。最后,为深入理解无遗憾方法的学习动力学,我们在OMDPs中无外部遗憾对手的相同设定下,引入了一种能实现纳什均衡最后轮收敛的算法。据我们所知,这是首个在OMDPs中取得最后迭代收敛结果的工作。