In this paper, online game is studied, where at each time, a group of players aim at selfishly minimizing their own time-varying cost function simultaneously subject to time-varying coupled constraints and local feasible set constraints. Only local cost functions and local constraints are available to individual players, who can share limited information with their neighbors through a fixed and connected graph. In addition, players have no prior knowledge of future cost functions and future local constraint functions. In this setting, a novel decentralized online learning algorithm is devised based on mirror descent and a primal-dual strategy. The proposed algorithm can achieve sublinearly bounded regrets and constraint violation by appropriately choosing decaying stepsizes. Furthermore, it is shown that the generated sequence of play by the designed algorithm can converge to the variational GNE of a strongly monotone game, to which the online game converges. Additionally, a payoff-based case, i.e., in a bandit feedback setting, is also considered and a new payoff-based learning policy is devised to generate sublinear regrets and constraint violation. Finally, the obtained theoretical results are corroborated by numerical simulations.
翻译:摘要:本文研究在线博弈问题,其中在每一时刻,一组参与者自私地同时最小化其各自的时变成本函数,同时受时变耦合约束和局部可行集约束的制约。每个参与者仅能获取本地成本函数与局部约束信息,并可通过固定连通图与邻居节点共享有限信息。此外,参与者对未来成本函数及未来局部约束函数无先验知识。在此设定下,基于镜像下降法和原始-对偶策略,本文设计了一种新颖的分散式在线学习算法。通过适当选择递减步长,所提算法可实现次线性有界遗憾与约束违反。进一步证明,由该算法生成的行动序列可收敛至强单调博弈的变分广义纳什均衡,而该在线博弈即收敛于此均衡。此外,本文还考虑了基于收益的反馈情形(即赌博机反馈设定),并设计了一种新的基于收益的学习策略,可产生次线性遗憾与约束违反。最后,通过数值仿真验证了所得理论结果的有效性。