I study adversarial attacks against stochastic bandit algorithms. At each round, the learner chooses an arm, and a stochastic reward is generated. The adversary strategically adds corruption to the reward, and the learner is only able to observe the corrupted reward at each round. Two sets of results are presented in this paper. The first set studies the optimal attack strategies for the adversary. The adversary has a target arm he wishes to promote, and his goal is to manipulate the learner into choosing this target arm $T - o(T)$ times. I design attack strategies against UCB and Thompson Sampling that only spend $\widehat{O}(\sqrt{\log T})$ cost. Matching lower bounds are presented, and the vulnerability of UCB, Thompson sampling, and $\varepsilon$-greedy are exactly characterized. The second set studies how the learner can defend against the adversary. Inspired by literature on smoothed analysis and behavioral economics, I present two simple algorithms that achieve a competitive ratio arbitrarily close to 1.
翻译:我研究了对随机多臂赌博机算法的对抗攻击。在每一轮中,学习者选择一个臂,并产生一个随机奖励。对手策略性地对奖励添加扰动,而学习者每轮仅能观测到被篡改后的奖励。本文呈现两组结果。第一组研究对手的最优攻击策略。对手有一个希望推广的目标臂,其目标是操纵学习者选择该目标臂 $T - o(T)$ 次。我设计了针对UCB和汤普森采样的攻击策略,仅需付出$\widehat{O}(\sqrt{\log T})$的代价。本文给出了匹配的下界,并精确刻画了UCB、汤普森采样和$\varepsilon$-贪婪算法的脆弱性。第二组研究学习者如何防御对手。受平滑分析与行为经济学文献启发,我提出了两种简单算法,实现了任意接近于1的竞争比。