In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
翻译:本文研究上下文多项式逻辑斯蒂(MNL)强盗问题,其中学习代理基于上下文信息顺序选择商品组合,用户反馈遵循MNL选择模型。现有研究中,遗憾下界与上界之间存在显著差异,尤其是在最大商品组合规模$K$方面。此外,这些界限之间奖励结构的变化使得最优性的探索更加复杂。在均匀奖励场景(所有商品具有相同期望奖励)下,我们建立了$\Omega(d\sqrt{\smash[b]{T/K}})$的遗憾下界,并提出一种常数时间算法OFU-MNL+,实现了匹配的上界$\tilde{O}(d\sqrt{\smash[b]{T/K}})$。在非均匀奖励场景下,我们证明了$\Omega(d\sqrt{T})$的下界和$\tilde{O}(d\sqrt{T})$的上界,同样可由OFU-MNL+算法实现。实证研究支持了这些理论发现。据我们所知,这是上下文MNL强盗文献中首次证明极小化最优性——无论对于均匀或非均匀奖励设置——并提出一种计算高效且能在对数因子内实现该最优性的算法。