On Pareto Optimality for Parametric Choice Bandits

We study online assortment optimization under stochastic choice when a decision maker simultaneously values cumulative revenue performance and the quality of post-hoc inference on revenue contrasts. We analyze a forced-exploration optimism-in-the-face-of-uncertainty (OFU) scheme that combines two regularized maximum-likelihood estimators: one based on all observations for sequential decision making, and one based only on exploration rounds for inference. Our general theory is developed under predictable score proxies and per-round action-dependent curvature domination. Under these conditions we establish a self-normalized concentration inequality, a likelihood-based ellipsoidal confidence-set theorem, and a regret bound for approximate optimistic actions that explicitly accounts for optimization error. For the multinomial logit (MNL) model we derive explicit score and curvature proxies and show that a balanced spaced singleton-exploration schedule yields realized coordinate coverage, implying regret $\Otilde(n_T + T/\sqrt{n_T})$ and revenue-contrast error $\Otilde(1/\sqrt{n_T})$ up to fixed problem-dependent factors. A hard two-assortment subclass yields a matching lower bound at the product level. Consequently, within the polynomial exploration family $n_T \asymp T^α$, the regret and inference rates become $\Otilde(T^{\max\{α,1-α/2\}})$ and $\Otilde(T^{-α/2})$, respectively; hence $α\in[2/3,1)$ is the rate-wise Pareto-undominated interval and $α=2/3$ is the unique balancing point that minimizes the regret exponent. Finally, for the Exponomial Choice and Nested Logit models we state verifiable sufficient conditions that would instantiate the general framework.

翻译：我们研究了在随机选择下的在线品种优化问题，其中决策者同时关注累积收益表现和事后收益对比推断的质量。我们分析了一种强制探索的面对不确定性乐观（OFU）方案，该方案结合了两个正则化最大似然估计：一个基于所有观测值用于序贯决策，另一个仅基于探索轮次用于推断。我们的通用理论是在可预测得分代理和每轮动作相关的曲率支配条件下建立的。在这些条件下，我们建立了一个自归一化浓度不等式、一个基于似然的椭球置信集定理以及一个显式考虑优化误差的近似乐观动作的遗憾界。对于多项式逻辑选择（MNL）模型，我们推导了显式的得分和曲率代理，并表明平衡的空间单例探索调度实现了实际坐标覆盖，从而得到遗憾界Õ(n_T + T/√n_T)和收益对比误差Õ(1/√n_T)，直至固定的问题相关因子。一个硬性的两品种子类在乘积层次上给出了匹配的下界。因此，在多项式探索族n_T ≍ T^α内，遗憾和推断率分别变为Õ(T^{max{α,1-α/2}})和Õ(T^{-α/2})；因而α∈[2/3,1)是率值帕累托非支配区间，α=2/3是最小化遗憾指数的唯一平衡点。最后，对于指数型选择模型和嵌套逻辑选择模型，我们给出了可验证的充分条件，这些条件将实例化通用框架。