We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for k > 1, top-two algorithms cannot achieve optimality even when the algorithm has access to the unknown "optimal" tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.
翻译:我们考虑多臂赌博机中的最佳k臂识别问题,目标是通过顺序分配测量资源,精确选出具有最高平均奖励的k个臂。我们利用对偶变量刻画了最优分配的必要充分条件。值得注意的是,这些最优性条件导致了最初为最佳单臂识别提出的Top-Two算法设计原则(Russo, 2020)的推广。此外,我们的最优性条件衍生出一种简单有效的选择规则,称为信息导向选择(IDS),该规则基于信息增益度量从两个候选者中择一。作为理论保障,我们证明将IDS与Top-Two汤普森采样结合时,对于高斯最佳单臂识别是(渐近)最优的,从而解决了纯探索文献中一个显著的开放问题(Russo, 2020)。作为副产品,我们证明当k>1时,即使算法能够访问未知的"最优"调参参数,Top-Two算法也无法达到最优性。数值实验显示,所提出的带有IDS的Top-Two算法具有卓越性能,相较于无自适应选择的算法有显著提升。