Learning with expert advice and multi-armed bandit are two classic online decision problems which differ on how the information is observed in each round of the game. We study a family of problems interpolating the two. For a vector $\mathbf{m}=(m_1,\dots,m_K)\in \mathbb{N}^K$, an instance of $\mathbf{m}$-MAB indicates that the arms are partitioned into $K$ groups and the $i$-th group contains $m_i$ arms. Once an arm is pulled, the losses of all arms in the same group are observed. We prove tight minimax regret bounds for $\mathbf{m}$-MAB and design an optimal PAC algorithm for its pure exploration version, $\mathbf{m}$-BAI, where the goal is to identify the arm with minimum loss with as few rounds as possible. We show that the minimax regret of $\mathbf{m}$-MAB is $\Theta\left(\sqrt{T\sum_{k=1}^K\log (m_k+1)}\right)$ and the minimum number of pulls for an $(\epsilon,0.05)$-PAC algorithm of $\mathbf{m}$-BAI is $\Theta\left(\frac{1}{\epsilon^2}\cdot \sum_{k=1}^K\log (m_k+1)\right)$. Both our upper bounds and lower bounds for $\mathbf{m}$-MAB can be extended to a more general setting, namely the bandit with graph feedback, in terms of the clique cover and related graph parameters. As consequences, we obtained tight minimax regret bounds for several families of feedback graphs.
翻译:专家建议学习和多臂赌博机是两个经典的在线决策问题,二者区别在于每轮博弈中信息的观测方式不同。我们研究了一类介于两者之间的插值问题。对于向量 $\mathbf{m}=(m_1,\dots,m_K)\in \mathbb{N}^K$,$\mathbf{m}$-MAB 实例表示将臂划分为 $K$ 组,第 $i$ 组包含 $m_i$ 个臂。一旦某个臂被拉动,同组所有臂的损失值均可观测。我们证明了 $\mathbf{m}$-MAB 的紧致极小极大遗憾界,并为其纯探索版本 $\mathbf{m}$-BAI(旨在以尽可能少的轮次识别出损失最小的臂)设计了最优 PAC 算法。研究表明,$\mathbf{m}$-MAB 的极小极大遗憾为 $\Theta\left(\sqrt{T\sum_{k=1}^K\log (m_k+1)}\right)$,而 $\mathbf{m}$-BAI 的 $(\epsilon,0.05)$-PAC 算法所需的最小拉动次数为 $\Theta\left(\frac{1}{\epsilon^2}\cdot \sum_{k=1}^K\log (m_k+1)}\right)$。针对 $\mathbf{m}$-MAB 的上界和下界均可推广至更一般的设定,即具有图反馈的赌博机问题,其结论以团覆盖及相关图参数形式表述。作为推论,我们得到了若干反馈图族的紧致极小极大遗憾界。