We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.
翻译:我们研究在一个动态演化环境中的学习问题,该环境被建模为学习者和一个能够适应学习者策略的战略对手之间的马尔可夫博弈。虽然现有关于马尔可夫博弈的研究大多以外遗憾作为学习目标,但当对手具有自适应性时,外遗憾变得不再适用。在本工作中,我们聚焦于**策略遗憾**——一种反事实概念,旨在与学习者在事后若遵循最佳固定策略序列所能获得的回报进行竞争。我们证明,如果对手具有无界记忆,或者它是非平稳的,那么样本高效的学习是不可能的。对于有界记忆且平稳的对手,我们证明,如果学习者的可行策略集是指数级庞大的,学习在统计上仍然是困难的。为了保证可学习性,我们引入了一种新的**一致性**自适应对手概念,其中,对手对学习者的相似策略会做出相似的反应。我们提出了能够针对有界记忆、平稳且一致的对手实现 $\sqrt{T}$ 策略遗憾的算法。