We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To this end, we design and analyze the BoBW-lil'UCB$(\gamma)$ algorithm. Complementarily, by establishing lower bounds on the regret achievable by any algorithm with a given BAI failure probability, we show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives, and (ii) BoBW-lil'UCB$(\gamma)$ achieves order-wise optimal performance for RM or BAI under different values of $\gamma$. Our work elucidates the trade-off more precisely by showing how the constants in previous works depend on certain hardness parameters. Finally, we show that BoBW-lil'UCB outperforms a close competitor UCB$_\alpha$ (Degenne et al., 2019) in terms of the time complexity and the regret on diverse datasets such as MovieLens and Published Kinase Inhibitor Set.
翻译:我们研究了多臂老虎机中两个典型目标——固定时间范围内的遗憾最小化(Regret Minimization, RM)与最佳臂识别(Best Arm Identification, BAI)——的帕累托前沿。众所周知,探索与利用的权衡对RM和BAI均至关重要,但探索对于实现后一目标的最优性能更为关键。为此,我们设计并分析了BoBW-lil'UCB$(\gamma)$算法。作为补充,通过建立任何具有给定BAI失败概率的算法所能达到的遗憾下界,我们证明:(i) 没有任何算法能同时实现RM和BAI目标的最优性能;(ii) BoBW-lil'UCB$(\gamma)$在$\gamma$取不同值时,分别实现RM或BAI意义上的阶最优性能。我们的工作通过揭示先前工作中常数项如何依赖于特定困难参数,更精确地阐明了这一权衡。最后,我们证明BoBW-lil'UCB在MovieLens和Published Kinase Inhibitor Set等多样数据集上的时间复杂度及遗憾值均优于其紧密竞争者UCB$_\alpha$(Degenne等人,2019)。