We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
翻译:本文研究一种在线随机匹配问题,其中算法需在$T$轮中将$U$个用户依次匹配至$K$个臂,目标是在预算约束下最大化累积奖励。若无结构假设,计算最优匹配是NP难问题,使得在线学习在计算上不可行。为突破此障碍,我们聚焦于单峰偏好——社会选择理论中一种成熟的结构,其中用户偏好相对于臂的公共序呈单峰形态。我们为离线预算匹配问题设计了一种高效算法,并将其转化为具有$\tilde O(UKT^{2/3})$遗憾界的高效在线算法。我们的方法依赖于一种基于PQ树的新型序逼近技术。若单峰结构已知,我们提出一种类UCB高效算法,其遗憾界可达$\tilde O(U\sqrt{TK})$。