We consider a variant of the stochastic multi-armed bandit problem. Specifically, the arms are strategic agents who can improve their rewards or absorb them. The utility of an agent increases if she is pulled more or absorbs more of her rewards but decreases if she spends more effort improving her rewards. Agents have heterogeneous properties, specifically having different means and able to improve their rewards up to different levels. Further, a non-empty subset of agents are ''honest'' and in the worst case always give their rewards without absorbing any part. The principal wishes to obtain a high revenue (cumulative reward) by designing a mechanism that incentives top level performance at equilibrium. At the same time, the principal wishes to be robust and obtain revenue at least at the level of the honest agent with the highest mean in case of non-equilibrium behaviour. We identify a class of MAB algorithms which we call performance incentivizing which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium and are robust under any strategy profile. Interestingly, we show that UCB is an example of such a MAB algorithm. Further, in the case where the top performance level is unknown we show that combining second price auction ideas with performance incentivizing algorithms achieves performance at least at the second top level while also being robust.
翻译:我们考虑一种随机多臂赌博机问题的变体。具体而言,臂是能够提升自身收益或截留收益的战略智能体。智能体的效用随被选中次数增加或截留更多收益而提升,但随其投入更多努力提升收益而降低。智能体具有异质性特征,具体表现为均值不同,且其收益提升能力也存在上限差异。此外,存在非空子集的“诚实”智能体,在最坏情况下会始终提供收益而不截留任何部分。委托人希望通过设计一种机制,在均衡状态下激励最高水平的绩效,从而获得高收益(累积奖励)。同时,委托人希望该机制具有鲁棒性,能够确保在非均衡行为下至少达到具有最高均值的诚实智能体所对应的收益水平。我们识别出一类满足特定性质集合的多臂赌博机算法,并将其称为绩效激励算法。我们证明,这些算法能够形成在均衡状态下激励最高绩效的机制,并在任意策略组合下保持鲁棒性。有趣的是,我们证明UCB算法正是此类多臂赌博机算法的典型实例。此外,当最高绩效水平未知时,我们证明将第二价格拍卖思想与绩效激励算法相结合,可在保证鲁棒性的前提下实现至少达到第二高绩效水平的性能。