We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.
翻译:我们奠定了固定预算T下多臂老虎机最优臂识别的非参数理论基础。我们考虑臂分布的一般(可能非参数)模型D;一个贯穿全文的范例是[0,1]上所有概率分布构成的模型D = P(0,1)。基于信息论量(这些量对应于D中某些分布与给定分布之间Kullback-Leibler散度的下确界),我们提出了误识别最优臂的平均对数概率的上界。这一成果得益于对Audibert、Bubeck和Munos(2010)的逐次淘汰策略的精细化分析。最后,我们给出了同一平均对数概率的下界,同样以这些新的信息论量表达;当对所考虑策略的(自然)假设更强时,这些下界更大。所有这些新的上下界推广了基于分布间间隔等概念的现有界。