In this paper we propose a general methodology to derive regret bounds for randomized multi-armed bandit algorithms. It consists in checking a set of sufficient conditions on the sampling probability of each arm and on the family of distributions to prove a logarithmic regret. As a direct application we revisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS), under various models for the distributions including single parameter exponential families, Gaussian distributions, bounded distributions, or distributions satisfying some conditions on their moments. In particular, we prove that MED is asymptotically optimal for all these models, but also provide a simple regret analysis of some TS algorithms for which the optimality is already known. We then further illustrate the interest of our approach, by analyzing a new Non-Parametric TS algorithm (h-NPTS), adapted to some families of unbounded reward distributions with a bounded h-moment. This model can for instance capture some non-parametric families of distributions whose variance is upper bounded by a known constant.
翻译:本文提出了一种通用方法论,用于推导随机化多臂老虎机算法的遗憾界。该方法通过检验各臂采样概率及分布族的一组充分条件,证明对数遗憾界的存在。作为直接应用,我们在多种分布模型下重新审视了两种著名的老虎机算法:最小经验散度算法(MED)和汤普森采样算法(TS),涵盖单参数指数族、高斯分布、有界分布以及满足特定矩条件的分布。特别地,我们证明了MED在所有上述模型下是渐近最优的,并对已知最优性的某些TS算法提供了简洁的遗憾分析。通过分析一种适用于具有有界h-矩的无界奖励分布族的新型非参数TS算法(h-NPTS),我们进一步展示了本文方法的实用性。该模型可捕捉某些方差上界已知的非参数分布族。