This work tackles the complexities of multi-player scenarios in \emph{unknown games}, where the primary challenge lies in navigating the uncertainty of the environment through bandit feedback alongside strategic decision-making. We introduce Thompson Sampling (TS)-based algorithms that exploit the information of opponents' actions and reward structures, leading to a substantial reduction in experimental budgets -- achieving over tenfold improvements compared to conventional approaches. Notably, our algorithms demonstrate that, given specific reward structures, the regret bound depends logarithmically on the total action space, significantly alleviating the curse of multi-player. Furthermore, we unveil the \emph{Optimism-then-NoRegret} (OTN) framework, a pioneering methodology that seamlessly incorporates our advancements with established algorithms, showcasing its utility in practical scenarios such as traffic routing and radar sensing in the real world.
翻译:本文致力于解决未知博弈中多智能体场景的复杂性,其主要挑战在于通过臂式反馈应对环境不确定性,同时进行策略性决策。我们提出基于汤普森采样的算法,利用对手动作和奖励结构信息,显著降低实验预算——相比传统方法实现超过十倍的性能提升。值得注意的是,我们的算法表明:在特定奖励结构下,遗憾界与总动作空间呈对数关系,从而显著缓解多智能体维度灾难问题。此外,我们揭示了“乐观-然后-无遗憾”(OTN)框架——这一开创性方法论将我们的进展与既有算法无缝融合,在交通路由、雷达感知等实际场景中展现应用价值。