We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach which exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm, referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds and that the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $\epsilon$-c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.
翻译:我们研究了在未知奖励与未知约束函数条件下重复进行情境博弈的学习问题。此类博弈出现在每个智能体的动作需属于可行集、但该可行集先验未知的应用场景中。例如,在约束多智能体强化学习中,智能体策略所受约束取决于未知动态特性,因此约束本身也是未知的。基于对未知函数的核正则性假设,我们提出了一种无遗憾、无违规的方法,该方法利用了不同奖励与约束结果之间的相似性。无违规特性保证了时间平均的约束违规累计和随博弈重复进行而趋于零。我们证明,所提出的算法(称为 c.z.AdaNormalGP)可获得依赖于核的遗憾界,且累积约束违规具有次线性依赖于核的上界。此外,我们引入了约束情境粗糙相关均衡(c.z.CCE)的概念,并证明当玩家遵循无遗憾无违规策略时,可逼近ε-c.z.CCE。最后,我们通过多智能体强化学习实例实验验证了 c.z.AdaNormalGP 的有效性。