We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a $\tilde{\mathcal{O}} (\sqrt{KT})$ regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design.
翻译:我们研究了多臂赌博机问题的一个策略性变体,并将其命名为策略点击赌博机。该模型受在线推荐应用启发,其中推荐项目的选择同时取决于点击率和点击后奖励。与经典赌博机类似,奖励服从固定的未知分布。然而,我们假设每个臂的点击率由该臂(例如Airbnb上的房东)策略性选择,以最大化其被点击次数。算法设计者事先既不知道点击后奖励,也不知道各臂的策略性点击率选择行为,必须随时间同时学习这两个值。为解决该问题,我们设计了一种激励感知的学习算法UCB-S,该算法同时实现两个目标:(a) 在不确定性下激励臂的理想行为;(b) 通过学习未知参数最小化遗憾。我们刻画了UCB-S下所有臂的近似纳什均衡,并证明了在任何均衡下$\tilde{\mathcal{O}} (\sqrt{KT})$的遗憾上界。我们还表明,非激励感知算法通常无法在策略点击赌博机中实现低遗憾。最后,通过模拟策略性臂行为验证了所提激励机制的有效性和鲁棒性,支持了我们的理论结果。