We consider a stochastic multi-arm bandit problem where rewards are subject to adversarial corruption. We propose a novel attack strategy that manipulates a UCB principle into pulling some non-optimal target arm $T - o(T)$ times with a cumulative cost that scales as $\sqrt{\log T}$, where $T$ is the number of rounds. We also prove the first lower bound on the cumulative attack cost. Our lower bound matches our upper bound up to $\log \log T$ factors, showing our attack to be near optimal.
翻译:我们考虑一个随机多臂赌博机问题,其中奖励受到对抗性腐败的影响。我们提出了一种新颖的攻击策略,能够操纵UCB算法在$T - o(T)$轮中强制选择某个非最优目标臂,且累积成本随$\sqrt{\log T}$增长,其中$T$为轮数。我们还证明了首个关于累积攻击成本的下界。该下界与我们的上界在$\log \log T$因子内匹配,表明我们的攻击近乎最优。