In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
翻译:本文研究了上下文多类别逻辑(MNL)赌博机问题,其中学习代理基于上下文信息顺序选择商品组合,用户反馈遵循MNL选择模型。现有研究中的遗憾下界与上界存在显著差异,尤其在最大组合规模$K$方面。此外,这些界限间奖励结构的差异使得最优性分析更为复杂。在均匀奖励(所有商品具有相同期望奖励)设定下,我们建立了$\Omega(d\sqrt{\smash[b]{T/K}})$的遗憾下界,并提出了一种常数时间算法OFU-MNL+,该算法实现了匹配的上界$\tilde{O}(d\sqrt{\smash[b]{T/K}})$。我们同时给出了均匀奖励下的实例依赖极小极大遗憾界。在非均匀奖励设定下,我们证明了$\Omega(d\sqrt{T})$的下界与$\tilde{O}(d\sqrt{T})$的上界,该上界同样可由OFU-MNL+算法达成。实证研究支持了这些理论发现。据我们所知,这是上下文MNL赌博机研究中首次证明极小极大最优性——无论是针对均匀或非均匀奖励设定——并提出计算高效算法在忽略对数因子条件下达到该最优性的工作。