Thompson sampling is one of the most popular learning algorithms for online sequential decision-making problems and has rich real-world applications. However, current Thompson sampling algorithms are limited by the assumption that the rewards received are uncorrupted, which may not be true in real-world applications where adversarial reward poisoning exists. To make Thompson sampling more reliable, we want to make it robust against adversarial reward poisoning. The main challenge is that one can no longer compute the actual posteriors for the true reward, as the agent can only observe the rewards after corruption. In this work, we solve this problem by computing pseudo-posteriors that are less likely to be manipulated by the attack. We propose robust algorithms based on Thompson sampling for the popular stochastic and contextual linear bandit settings in both cases where the agent is aware or unaware of the budget of the attacker. We theoretically show that our algorithms guarantee near-optimal regret under any attack strategy.
翻译:汤普森采样是解决在线序贯决策问题最流行的学习算法之一,在现实世界中有着广泛的应用。然而,现有的汤普森采样算法受限于一个假设,即接收到的奖励是未经篡改的,这在存在对抗性奖励投毒的真实应用场景中可能并不成立。为了使汤普森采样更加可靠,我们希望使其能够抵御对抗性奖励投毒。主要挑战在于,智能体只能观察到被篡改后的奖励,因此无法再计算真实奖励的实际后验分布。在本工作中,我们通过计算不易被攻击操纵的伪后验分布来解决这一问题。我们针对流行的随机线性赌博机和上下文线性赌博机场景,分别在智能体知晓或不知晓攻击者预算的情况下,提出了基于汤普森采样的鲁棒算法。我们从理论上证明了我们的算法在任何攻击策略下都能保证接近最优的遗憾界。