We address the problem of active online assortment optimization problem with preference feedback, which is a framework for modeling user choices and subsetwise utility maximization. The framework is useful in various real-world applications including ad placement, online retail, recommender systems, fine-tuning language models, amongst many. The problem, although has been studied in the past, lacks an intuitive and practical solution approach with simultaneously efficient algorithm and optimal regret guarantee. E.g., popularly used assortment selection algorithms often require the presence of a `strong reference' which is always included in the choice sets, further they are also designed to offer the same assortments repeatedly until the reference item gets selected -- all such requirements are quite unrealistic for practical applications. In this paper, we designed efficient algorithms for the problem of regret minimization in assortment selection with \emph{Plackett Luce} (PL) based user choices. We designed a novel concentration guarantee for estimating the score parameters of the PL model using `\emph{Pairwise Rank-Breaking}', which builds the foundation of our proposed algorithms. Moreover, our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods. Empirical evaluations corroborate our findings and outperform the existing baselines.
翻译:本文针对具有偏好反馈的主动在线组合优化问题展开研究,该框架用于建模用户选择行为与子集效用最大化。该框架在广告投放、在线零售、推荐系统、语言模型微调等众多实际场景中具有重要应用价值。尽管该问题已有一定研究基础,但现有方法仍缺乏兼具高效性和最优遗憾保证的直观实用解决方案。例如,当前流行的组合选择算法往往要求包含一个始终存在于选择集中的“强参照项”,且设计为重复提供相同组合直至参照项被选中——这些约束在实际应用中极不现实。本文基于Plackett-Luce(PL)用户选择模型,设计了组合选择中遗憾最小化的高效算法。我们创新性地利用“成对排序断点”技术提出了PL模型参数估计的新浓度界限,这构成了我们提出算法的基础。所提方法兼具实用性、理论最优性,并规避了现有方法的上述局限。实验评估验证了我们的理论推导,并展现出超越现有基准方法的性能优势。