Stochastic multi-armed bandits (MABs) provide a fundamental reinforcement learning model to study sequential decision making in uncertain environments. The upper confidence bounds (UCB) algorithm gave birth to the renaissance of bandit algorithms, as it achieves near-optimal regret rates under various moment assumptions. Up until recently most UCB methods relied on concentration inequalities leading to confidence bounds which depend on moment parameters, such as the variance proxy, that are usually unknown in practice. In this paper, we propose a new distribution-free, data-driven UCB algorithm for symmetric reward distributions, which needs no moment information. The key idea is to combine a refined, one-sided version of the recently developed resampled median-of-means (RMM) method with UCB. We prove a near-optimal regret bound for the proposed anytime, parameter-free RMM-UCB method, even for heavy-tailed distributions.
翻译:随机多臂老虎机(MAB)为研究不确定环境下的序贯决策提供了基础强化学习模型。上置信界(UCB)算法开创了老虎机算法的新纪元,因其能在多种矩假设下实现近最优遗憾率。直至近期,大多数UCB方法仍依赖集中不等式,其置信界依赖于通常未知的矩参数(如方差代理)。本文针对对称奖励分布提出一种无需矩信息、无分布假设的数据驱动UCB算法。核心思想是将近期发展的重采样中位数均值(RMM)方法的改进单侧版本与UCB相结合。我们证明,所提出的无参数、任意时刻可用的RMM-UCB方法即使在重尾分布下也能获得近最优遗憾界。