We introduce a learning problem in a generalized two-sided matching market, where agents select actions to interact with their match. Specifically, we consider a setting in which matched agents engage in zero-sum games with initially unknown payoff matrices, and we investigate whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of a \emph{matching equilibrium}, where a matching \( \mathfrak{m} \) and a set of agent strategies \( X \) form an equilibrium if no agent has an incentive to deviate from \( (\mathfrak{m}, X) \). To quantify deviations of a candidate solution \( (\mathfrak{m}, X) \) from the equilibrium \( (\mathfrak{m}^\star, X^\star) \), we introduce the notion of \emph{matching instability}, which serves as a regret measure for the learning problem. We propose a UCB-based algorithm in which agents form preferences and select actions according to optimistic estimates of the payoffs. Our analysis establishes a sublinear, instance-independent regret upper bound, further supported by empirical evidence.
翻译:我们在广义双边匹配市场中引入一个学习问题,其中智能体选择动作与匹配对象进行交互。具体来说,我们考虑一个场景:匹配的智能体参与初始收益矩阵未知的零和博弈,并研究集中式过程能否从博弈反馈中学习到均衡。我们采用*匹配均衡*的解概念,即匹配方案\(\mathfrak{m}\)和智能体策略集\(X\)构成均衡的条件是没有任何智能体有动机偏离\((\mathfrak{m}, X)\)。为了量化候选解\((\mathfrak{m}, X)\)与均衡\((\mathfrak{m}^\star, X^\star)\)的偏差,我们引入*匹配不稳定性*概念,作为学习问题的遗憾度量。我们提出一种基于UCB的算法,其中智能体根据收益的乐观估计形成偏好并选择动作。我们的分析建立了次线性、实例无关的遗憾上界,并通过实证证据进一步支持。