In this paper, we investigate the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension $d$ and the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{\mathcal{O}}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the MNL contextual bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
翻译:本文研究了上下文多项式逻辑斯谛(MNL)赌博机问题,其中学习智能体基于上下文信息顺序选择组合,用户反馈遵循MNL选择模型。现有下界与上界遗憾之间存在显著差距,尤其在特征维度 $d$ 和最大组合规模 $K$ 方面。此外,两类界中奖励结构的不同使得最优性探索更为复杂。在均匀奖励(所有物品具有相同期望奖励)条件下,我们建立了 $\Omega(d\sqrt{\smash[b]{T/K}})$ 的遗憾下界,并提出常数时间算法OFU-MNL+,实现了匹配的 $\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$ 上界。在非均匀奖励条件下,我们证明了 $\Omega(d\sqrt{T})$ 的下界和 $\tilde{\mathcal{O}}(d\sqrt{T})$ 的上界,该上界同样可由OFU-MNL+实现。实证研究支持了这些理论结果。据我们所知,这是MNL上下文赌博机文献中首次在均匀或非均匀奖励设置下证明极小极大最优性,并提出了计算高效算法,其最优性仅在对数因子范围内有偏差。