We study a security threat to adversarial multi-armed bandits, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target arm in every but sublinear (T-o(T)) number of rounds, while incurring only sublinear (o(T)) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.
翻译:我们研究了一种针对对抗性多臂赌博机的安全威胁,其中攻击者通过扰动损失或奖励信号来控制受害者赌博机玩家的行为。我们证明,攻击者能够误导任何无遗憾对抗性赌博机算法,使其在除次线性(T-o(T))轮次外的每一轮中选择一个次优目标臂,同时仅需付出次线性(o(T))的累积攻击成本。该结果表明,现实世界中基于赌博机的系统存在严重的安全隐患,例如在线推荐系统中,攻击者可能劫持推荐系统并推广其期望产品。我们提出的攻击算法仅需了解遗憾率的性质,因此对受害者玩家采用的具体赌博机算法具有不可知性。我们还推导了任何未知受害者具体算法的攻击算法必须付出的累积攻击成本理论下界。该下界与我们攻击算法实现的上界相匹配,证明了我们的攻击具有渐近最优性。