We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.
翻译:我们推导了睡眠赌博机问题的近似最优每动作遗憾界,其中每轮中可用臂集及其损失均由对手选择。在总臂数为$K$、每轮最多$A$个可用臂、共$T$轮的环境中,已知最优上界为$O(K\sqrt{TA\ln{K}})$,该结果通过最小化内部睡眠遗憾间接获得。相较于极小极大下界$\Omega(\sqrt{TA})$,此上界包含额外的乘性因子$K\ln{K}$。我们通过采用EXP3、EXP3-IX及基于Tsallis熵的FTRL的广义版本直接最小化每动作遗憾,从而填补该差距,得到阶为$O(\sqrt{TA\ln{K}})$和$O(\sqrt{T\sqrt{AK}})$的近似最优界。我们将结果扩展至带有睡眠专家建议的赌博机问题,并在此过程中推广了EXP4算法。这为标准非睡眠赌博机问题中若干现有自适应与跟踪遗憾界提供了新证明。将结果扩展至带有置信度报告的专家赌博机版本,可得到主要取决于专家置信度之和的置信遗憾新界。我们证明了如下下界:对于任意极小极大最优算法,总存在一个动作,其遗憾在$T$上次线性增长,但其活跃轮次上呈线性增长。