Player-optimal Stable Regret for Bandit Learning in Matching Markets

The problem of matching markets has been studied for a long time in the literature due to its wide range of applications. Finding a stable matching is a common equilibrium objective in this problem. Since market participants are usually uncertain of their preferences, a rich line of recent works study the online setting where one-side participants (players) learn their unknown preferences from iterative interactions with the other side (arms). Most previous works in this line are only able to derive theoretical guarantees for player-pessimal stable regret, which is defined compared with the players' least-preferred stable matching. However, under the pessimal stable matching, players only obtain the least reward among all stable matchings. To maximize players' profits, player-optimal stable matching would be the most desirable. Though \citet{basu21beyond} successfully bring an upper bound for player-optimal stable regret, their result can be exponentially large if players' preference gap is small. Whether a polynomial guarantee for this regret exists is a significant but still open problem. In this work, we provide a new algorithm named explore-then-Gale-Shapley (ETGS) and show that the optimal stable regret of each player can be upper bounded by $O(K\log T/\Delta^2)$ where $K$ is the number of arms, $T$ is the horizon and $\Delta$ is the players' minimum preference gap among the first $N+1$-ranked arms. This result significantly improves previous works which either have a weaker player-pessimal stable matching objective or apply only to markets with special assumptions. When the preferences of participants satisfy some special conditions, our regret upper bound also matches the previously derived lower bound.

翻译：市场匹配问题因其广泛的应用而在文献中得到了长期研究。寻找稳定匹配是该问题中常见的均衡目标。由于市场参与者通常不确定自身偏好，近期一系列工作研究了在线设置，即单边参与者（玩家）通过与另一侧（臂）的迭代交互来学习其未知偏好。该方向的大多数先前工作仅能为玩家悲观稳定遗憾提供理论保证，该遗憾定义为与玩家最不偏好的稳定匹配相比的结果。然而，在悲观稳定匹配下，玩家只能获得所有稳定匹配中的最低奖励。为最大化玩家收益，玩家最优稳定匹配最为理想。尽管Basu等人（2021）成功提出了玩家最优稳定遗憾的上界，但当玩家偏好差距较小时，其结果可能呈指数级增大。该遗憾是否存在多项式保证仍是一个重要但悬而未决的问题。本文提出一种名为“先探索后盖尔-沙普利”（ETGS）的新算法，并证明每个玩家的最优稳定遗憾可被上界约束为$O(K\log T/\Delta^2)$，其中$K$为臂数量，$T$为时间范围，$\Delta$为玩家对前$N+1$个最高排名臂的最小偏好差距。该结果显著改进了先前工作——后者要么采用较弱的玩家悲观稳定匹配目标，要么仅适用于具有特殊假设的市场。当参与者偏好满足某些特殊条件时，我们的遗憾上界也与先前推导的下界相匹配。