I study a stochastic multi-arm bandit problem where rewards are subject to adversarial corruption. I propose a novel attack strategy that manipulates a learner employing the UCB algorithm into pulling some non-optimal target arm $T - o(T)$ times with a cumulative cost that scales as $\widehat{O}(\sqrt{\log T})$, where $T$ is the number of rounds. I also prove the first lower bound on the cumulative attack cost. The lower bound matches the upper bound up to $O(\log \log T)$ factors, showing the proposed attack strategy to be near optimal.
翻译:我研究了一个随机多臂赌博机问题,其中奖励受到对抗性篡改。我提出了一种新颖的攻击策略,能够操纵使用UCB算法的学习者在$T - o(T)$回合中选择某个非最优目标臂,且累积成本量级为$\widehat{O}(\sqrt{\log T})$,其中$T$为回合数。我还证明了关于累积攻击成本的第一个下界。该下界与上界在$O(\log \log T)$因子内相匹配,表明所提出的攻击策略是近似最优的。