Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burn-in period. During this period, standard MAB algorithms operate like randomized experiments, incurring large burn-in costs which scale with the large number of products. We attempt to reduce the burn-in by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.
翻译:多臂老虎机(MAB)算法是降低在线实验机会成本的有效方法,被企业用于从定期更新的产品目录中寻找最佳产品。然而,由于缺乏对新产品的客户偏好认知,这些算法在实验初期面临所谓的冷启动问题,需要一个称为预热期的初始数据收集阶段。在此阶段,标准MAB算法如同随机实验般运行,会产生高昂的预热成本,且该成本随产品数量增加而大幅上升。我们通过识别出许多产品可归类为双面产品来尝试减少预热成本,并自然地用矩阵对产品收益进行建模,其中行和列分别代表产品的两个侧面。接着,我们设计了两阶段老虎机算法:首先通过子采样和低秩矩阵估计获得一个显著缩小的目标产品集,随后在目标产品上应用UCB程序以寻找最优产品。理论上我们证明,当实验时间有限且产品集规模较大时,所提算法能有效降低成本并加速实验进程。我们的分析还揭示了根据矩阵维度划分的三种实验模式:长周期、短周期和超短周期实验。基于合成数据与音乐流媒体服务真实数据集的实证证据均验证了该算法的优越性能。