The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.
翻译:多臂赌博机(MAB)是经典的序贯决策问题。现有工作大多需要对奖励分布进行假设(如有限支撑),而实践者在处理非平稳MAB问题时,往往难以获取这些分布信息来设计模型。本文旨在设计一种无需利用奖励分布信息即可实现的多臂赌博机算法,同时仍能达到显著的遗憾上界。为此,我们提出了一种在贪心规则与强制探索之间交替进行的新颖算法。该方法适用于高斯分布、伯努利分布及其他亚高斯分布,且其实现无需额外信息。针对不同强制探索策略,我们采用统一的分析方法,在平稳和分段平稳设定下给出了问题相关的遗憾上界。此外,我们在不同奖励分布上将该算法与主流赌博机算法进行了比较。