The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [21]. The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions or a particular moment to exist, which can be hard to acquire or guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p$ and $q$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order O(log T), thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.
翻译:多臂赌博机(MAB)问题是运筹学与强化学习中研究序列决策的经典模型。本文主要研究具有重尾奖励分布的经典MAB模型。我们提出了扩展鲁棒UCB策略,该策略是对Bubeck等人[5]与Lattimore[21]提出的开创性UCB策略的扩展。现有UCB策略需要已知奖励分布特定阶矩的上界或确保特定阶矩存在,这在实际场景中往往难以获取或保证。我们的扩展鲁棒UCB将Lattimore的奠基性工作(针对$p=4$和$q=2$阶矩)推广至任意选取的$p$和$q$值,只要这两个矩满足已知的可控关系,同时仍能实现最优遗憾增长阶O(log T),从而为重尾奖励分布场景下的UCB策略提供了更广阔的应用范围。