We study decentralized equilibrium selection in stochastic games under severe information and communication constraints. In such settings, convergence to equilibrium alone is insufficient, as stochastic games typically admit many equilibria with markedly different welfare properties. We address decentralized optimal equilibrium selection, where agents coordinate on equilibria that optimize a designer-specified social welfare objective while allowing heterogeneous tolerance to deviations from strict best responses. Agents observe only the global state trajectory and their realized rewards, and exchange a single randomized bit of feedback per agent per round. This semantic content/discontent signaling mechanism implicitly aligns decentralized learning dynamics with the global welfare objective. We develop explore-and-commit and online variants applicable to general stochastic games, accommodating heterogeneous model-based or model-free methods for solving the induced Markov decision processes, and establish explicit finite-time regret guarantees, showing logarithmic expected regret under mild conditions.
翻译:我们研究在严重信息和通信约束下随机博弈中的去中心化均衡选择问题。在此类场景中,仅收敛至均衡是不够的,因为随机博弈通常存在多个具有显著不同福利特性的均衡。我们致力于解决去中心化的最优均衡选择问题,其中智能体在允许对严格最优响应存在异质容忍度的前提下,协调选择能优化设计者指定社会福利目标的均衡。智能体仅能观测全局状态轨迹及其获得的实际奖励,并在每轮中仅交换一个随机化的单比特反馈。这种基于语义满足/不满足的信号机制,隐式地将去中心化学习动态与全局福利目标对齐。我们开发了适用于一般随机博弈的探索-提交和在线变体算法,兼容基于模型或无模型的异质方法以求解诱导的马尔可夫决策过程,并建立了明确的有限时间遗憾界,证明了在温和条件下可实现对数级期望遗憾。