We consider the best arm identification problem in the stochastic multi-armed bandit framework where each arm has a tiny probability of realizing large rewards while with overwhelming probability the reward is zero. A key application of this framework is in online advertising where click rates of advertisements could be a fraction of a single percent and final conversion to sales, while highly profitable, may again be a small fraction of the click rates. Lately, algorithms for BAI problems have been developed that minimise sample complexity while providing statistical guarantees on the correct arm selection. As we observe, these algorithms can be computationally prohibitive. We exploit the fact that the reward process for each arm is well approximated by a Compound Poisson process to arrive at algorithms that are faster, with a small increase in sample complexity. We analyze the problem in an asymptotic regime as rarity of reward occurrence reduces to zero, and reward amounts increase to infinity. This helps illustrate the benefits of the proposed algorithm. It also sheds light on the underlying structure of the optimal BAI algorithms in the rare event setting.
翻译:我们研究了随机多臂老虎机框架下的最优臂识别问题,其中每个臂有极小的概率实现高额奖励,而绝大多数情况下奖励为零。该框架的一个关键应用是在线广告:广告点击率可能仅为百分之零点几,而最终转化为销售(尽管利润丰厚)可能又仅占点击率的极小比例。近年针对最优臂识别问题的算法不断发展,在确保正确臂选择的统计保证的同时最小化样本复杂度。但我们观察到,这些算法可能计算量过高。利用各臂奖励过程可被复合泊松过程良好逼近的特性,我们提出了计算更快速、仅略微增加样本复杂度的算法。我们在奖励发生概率趋近于零、奖励金额趋于无穷的渐近框架下分析该问题,这不仅揭示了所提算法的优势,也阐明了稀有事件场景中最优臂识别算法的基础结构。