Sequential learning in a multi-agent resource constrained matching market has received significant interest in the past few years. We study decentralized learning in two-sided matching markets where the demand side (aka players or agents) competes for a `large' supply side (aka arms) with potentially time-varying preferences, to obtain a stable match. Despite a long line of work in the recent past, existing learning algorithms such as Explore-Then-Commit or Upper-Confidence-Bound remain inefficient for this problem. In particular, the per-agent regret achieved by these algorithms scales linearly with the number of arms, $K$. Motivated by the linear contextual bandit framework, we assume that for each agent an arm-mean can be represented by a linear function of a known feature vector and an unknown (agent-specific) parameter. Moreover, our setup captures the essence of a dynamic (non-stationary) matching market where the preferences over arms change over time. Our proposed algorithms achieve instance-dependent logarithmic regret, scaling independently of the number of arms, $K$.
翻译:近年来,多智能体资源受限匹配市场中的序贯学习问题受到广泛关注。本研究探讨双边匹配市场中的去中心化学习问题,其中需求侧(即参与者或智能体)为获得稳定匹配,需在具有潜在时变偏好的“大规模”供给侧(即臂)中进行竞争。尽管近期已有大量相关研究,但现有学习算法(如“探索-后提交”或“上置信界算法”)对此问题仍效率不足。具体而言,这些算法实现的单智能体遗憾与臂数量$K$呈线性比例关系。受线性情境赌博机框架启发,我们假设每个智能体的臂期望收益可由已知特征向量与未知(智能体特定)参数的线性函数表示。此外,我们的模型设置捕捉了动态(非平稳)匹配市场的本质特征,其中对臂的偏好会随时间变化。我们提出的算法实现了与实例相关的对数遗憾,其缩放独立于臂数量$K$。