We study bandit learning in matching markets, where players and arms constitute the two market sides, and the players' utilities are linear in the arm contexts. In each round, new arms arrive with observable contexts. Then, the algorithm matches them to players, aiming to minimize each player's regret against a stable matching benchmark. This contextual structure creates significant complexity: subtle context shifts can slightly alter one player's utility while completely reconfiguring the underlying benchmark, causing large regret spikes for others. We address this in two settings: stochastic contexts, drawn from a latent distribution, and adversarial contexts, which may be arbitrary. For the stochastic case, we introduce a novel minimum preference gap to capture learning difficulty and provide a fully adaptive algorithm with an instance-dependent poly-logarithmic regret upper bound. We also establish matching instance-independent regret upper and lower bounds under a mild distributional assumption. For the adversarial setting, we propose a tractable regret notion that remains valid under arbitrary contexts and achieves an instance-independent sublinear regret bound via an adaptive algorithm.
翻译:我们研究匹配市场中的赌博机学习问题,其中参与者和臂构成市场的双方,参与者的效用与臂的上下文呈线性关系。每一轮中,新臂会携带可观察的上下文到达。随后,算法将这些臂匹配给参与者,旨在以稳定匹配基准为参照,最小化每个参与者的遗憾。这种上下文结构带来了显著复杂性:细微的上下文偏移可能轻微改变某个参与者的效用,却会完全重构底层基准,从而为其他参与者引发巨大的遗憾峰值。我们针对两种场景解决此问题:一种是从潜在分布中随机抽取的随机上下文,另一种是可能任意生成的对抗上下文。对于随机情况,我们引入最小偏好差距这一新概念来刻画学习难度,并提出一种完全自适应的算法,实现实例依赖的多对数遗憾上界。同时在温和的分布假设下,我们建立了实例无关的遗憾上下界匹配结果。对于对抗环境,我们提出一种在任意上下文中仍有效的可处理遗憾概念,并通过自适应算法实现实例无关的次线性遗憾界。