In bandit settings, optimizing long-term regret metrics requires exploration, which corresponds to sometimes taking myopically sub-optimal actions. When a long-lived principal merely recommends actions to be executed by a sequence of different agents (as in an online recommendation platform) this provides an incentive misalignment: exploration is "worth it" for the principal but not for the agents. Prior work studies regret minimization under the constraint of Bayesian Incentive-Compatibility in a static stochastic setting with a fixed and common prior shared amongst the agents and the algorithm designer. We show that (weighted) swap regret bounds on their own suffice to cause agents to faithfully follow forecasts in an approximate Bayes Nash equilibrium, even in dynamic environments in which agents have conflicting prior beliefs and the mechanism designer has no knowledge of any agents beliefs. To obtain these bounds, it is necessary to assume that the agents have some degree of uncertainty not just about the rewards, but about their arrival time -- i.e. their relative position in the sequence of agents served by the algorithm. We instantiate our abstract bounds with concrete algorithms for guaranteeing adaptive and weighted regret in bandit settings.
翻译:在多臂老虎机问题中,优化长期遗憾指标需要探索,这对应于有时选择短视次优行动。当长期存续的委托方仅向一系列不同智能体(如在线推荐平台)推荐待执行行动时,这会产生激励错配:探索对委托方"值得"但对智能体则不然。先前研究在静态随机环境中,基于智能体与算法设计者共享固定共同先验的贝叶斯激励兼容约束下探讨遗憾最小化。我们证明,(加权)交换遗憾界本身足以使智能体在近似贝叶斯纳什均衡中忠实遵循预测,即使在动态环境中——智能体具有冲突的先验信念且机制设计者完全不了解任何智能体的信念。为获得这些界限,必须假设智能体不仅对奖励存在不确定性,对其到达时间(即其在算法服务智能体序列中的相对位置)也存在某种程度的不确定性。我们通过具体算法实例化抽象界限,以保证多臂老虎机场景中的自适应加权遗憾。