Sequential learning with feedback graphs is a natural extension of the multi-armed bandit problem where the problem is equipped with an underlying graph structure that provides additional information - playing an action reveals the losses of all the neighbors of the action. This problem was introduced by \citet{mannor2011} and received considerable attention in recent years. It is generally stated in the literature that the minimax regret rate for this problem is of order $\sqrt{\alpha T}$, where $\alpha$ is the independence number of the graph, and $T$ is the time horizon. However, this is proven only when the number of rounds $T$ is larger than $\alpha^3$, which poses a significant restriction for the usability of this result in large graphs. In this paper, we define a new quantity $R^*$, called the \emph{problem complexity}, and prove that the minimax regret is proportional to $R^*$ for any graph and time horizon $T$. Introducing an intricate exploration strategy, we define the \mainAlgorithm algorithm that achieves the minimax optimal regret bound and becomes the first provably optimal algorithm for this setting, even if $T$ is smaller than $\alpha^3$.
翻译:带有反馈图的序贯学习是多臂赌博机问题的一个自然扩展,该问题配备了一个潜在图结构,提供额外信息——执行一个动作会揭示该动作所有邻居的损失。这一问题由\Mannor等人(2011)提出,并在近年来受到广泛关注。文献中通常指出,该问题的极小极大遗憾率阶为$\sqrt{\alpha T}$,其中$\alpha$是图的独立数,$T$是时间范围。然而,这一结论仅在回合数$T$大于$\alpha^3$时成立,这对该结果在大规模图中的可用性构成了显著限制。本文定义了一个新量$R^*$,称为问题复杂度,并证明对于任意图和时间范围$T$,极小极大遗憾与$R^*$成正比。通过引入一种复杂的探索策略,我们定义了\mainAlgorithm算法,该算法实现了极小极大最优遗憾界,成为这一设置下首个被证明为最优的算法,即使$T$小于$\alpha^3$也是如此。