We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.
翻译:我们研究具有任意且可能对抗性奖励的臂识别最优问题。简单的随机均匀学习器可在对抗场景中获得最优错误率,但该策略在奖励呈随机采样时表现欠优。因此我们提出疑问:能否设计一种学习器,使其在不知晓奖励性质的情况下,同时在随机与对抗两类问题中均达到最优性能?首先,我们证明此类学习器的设计通常不可行。具体而言,为对抗性奖励保持鲁棒性,我们仅能保证随机问题子集上的最优错误率。我们给出了一个下界,刻画了在策略需对对抗性奖励鲁棒约束下随机问题的最优错误率。最后,我们设计了一个简单的无参数算法,并证明其错误概率在随机问题中与下界匹配(在对数因子范围内),同时该算法对对抗性奖励亦具有鲁棒性。