We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach which exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm, referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds and that the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $\epsilon$-c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.
翻译:我们研究在奖励函数与约束函数均未知的情境下,重复进行情景博弈的学习问题。这类博弈出现在每个智能体的动作需属于可行集,但该可行集先验未知的应用场景中。例如,在约束多智能体强化学习中,对智能体策略的约束取决于未知系统动态,因此约束本身也是未知的。基于对未知函数的核正则性假设,我们提出一种无遗憾、无违规的方法,该方法利用不同奖励与约束结果之间的相似性。无违规特性确保随着博弈重复进行,约束违规的时间平均和收敛至零。我们证明所提算法c.z.AdaNormalGP可获得依赖于核的遗憾界,且累积约束违规具有次线性的核依赖上界。此外,我们引入受限情景粗相关均衡(c.z.CCE)的概念,并证明当所有玩家遵循无遗憾无违规策略时,可逼近ϵ-c.z.CCE。最后,我们在多智能体强化学习实例上通过实验验证了c.z.AdaNormalGP的有效性。