We investigate the problem of bandits with expert advice when the experts are fixed and known distributions over the actions. Improving on previous analyses, we show that the regret in this setting is controlled by information-theoretic quantities that measure the similarity between experts. In some natural special cases, this allows us to obtain the first regret bound for EXP4 that can get arbitrarily close to zero if the experts are similar enough. While for a different algorithm, we provide another bound that describes the similarity between the experts in terms of the KL-divergence, and we show that this bound can be smaller than the one of EXP4 in some cases. Additionally, we provide lower bounds for certain classes of experts showing that the algorithms we analyzed are nearly optimal in some cases.
翻译:我们研究了专家建议固定且已知动作分布时的赌博机问题。相比以往分析,我们证明该场景下的遗憾受控于衡量专家之间相似性的信息论量。在某些自然特例中,这使我们首次获得EXP4算法的遗憾界,当专家足够相似时该界可任意趋近于零。对于另一种算法,我们给出了以KL散度描述专家相似性的新界,并证明在某些情形下该界小于EXP4算法的界。此外,我们针对特定专家类别给出了下界,表明所分析算法在某些情况下接近最优。