We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $Θ(\log\log T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$.
翻译:本文提出了一种基于多项逻辑特(MNL)选择模型的随机匹配赌博机框架。在我们的设定中,一侧的$N$个智能体被分配到另一侧的$K$个臂上,每个臂根据未知偏好从其分配池中随机选择一个智能体,并在时间范围$T$内产生相应的奖励。目标是通过最大化成功匹配的累计收益来最小化遗憾。一种朴素方法需要在每一轮求解一个NP难组合优化问题,导致计算成本过高。为应对这一挑战,我们提出了批处理算法,策略性地将整个时间范围内匹配分配的更新次数限制为$Θ(\log\log T)$。通过仅在逐渐减少的轮次中调用昂贵的组合优化,我们的算法显著降低了整体计算开销,同时仍实现了$\widetilde{\mathcal{O}}(\sqrt{T})$的遗憾界。