In modern resource-sharing systems, multiple agents access limited resources with unknown stochastic conditions to perform tasks. When multiple agents access the same resource (arm) simultaneously, they compete for successful usage, leading to contention and reduced rewards. This motivates our study of competitive multi-armed bandit (CMAB) games. In this paper, we study a new N-player K-arm competitive MAB game, where non-myopic players (agents) compete with each other to form diverse private estimations of unknown arms over time. Their possible collisions on same arms and time-varying nature of arm rewards make the policy analysis more involved than existing studies for myopic players. We explicitly analyze the threshold-based structures of social optimum and existing selfish policy, showing that the latter causes prolonged convergence time $\Omega(\frac{K}{\eta^2}\ln({\frac{KN}{\delta}}))$, while socially optimal policy with coordinated communication reduces it to $\mathcal{O}(\frac{K}{N\eta^2}\ln{(\frac{K}{\delta})})$. Based on the comparison, we prove that the competition among selfish players for the best arm can result in an infinite price of anarchy (PoA), indicating an arbitrarily large efficiency loss compared to social optimum. We further prove that no informational (non-monetary) mechanism (including Bayesian persuasion) can reduce the infinite PoA, as the strategic misreporting by non-myopic players undermines such approaches. To address this, we propose a Combined Informational and Side-Payment (CISP) mechanism, which provides socially optimal arm recommendations with proper informational and monetary incentives to players according to their time-varying private beliefs. Our CISP mechanism keeps ex-post budget balanced for social planner and ensures truthful reporting from players, achieving the minimum PoA=1 and same convergence time as social optimum.
翻译:在现代资源共享系统中,多个智能体在未知随机条件下访问有限资源以执行任务。当多个智能体同时访问同一资源(臂)时,它们会竞争成功使用权,导致冲突并降低收益。这促使我们研究竞争性多臂老虎机(CMAB)博弈。本文研究一种新的N玩家K臂竞争性MAB博弈,其中非短视玩家(智能体)随时间形成对未知臂的多样化私有估计并相互竞争。他们在相同臂上的可能冲突以及臂收益的时变特性,使得策略分析比现有针对短视玩家的研究更为复杂。我们显式分析了社会最优解与现有自私策略的阈值结构,表明后者导致收敛时间延长至$\Omega(\frac{K}{\eta^2}\ln({\frac{KN}{\delta}}))$,而通过协调通信的社会最优策略可将其降低至$\mathcal{O}(\frac{K}{N\eta^2}\ln{(\frac{K}{\delta})})$。基于此比较,我们证明自私玩家对最优臂的竞争可能导致无政府状态代价(PoA)趋于无穷,表明相较于社会最优解可能产生任意大的效率损失。我们进一步证明任何信息性(非货币)机制(包括贝叶斯劝说)均无法降低无穷PoA,因为非短视玩家的策略性误报会破坏此类方法。为解决此问题,我们提出一种信息与侧支付组合(CISP)机制,该机制根据玩家时变的私有信念,通过恰当的信息与货币激励为其提供社会最优的臂推荐。我们的CISP机制保持社会规划者的事后预算平衡,并确保玩家真实报告,实现最小PoA=1及与社会最优解相同的收敛时间。