We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form $O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$, where $\mu^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length.
翻译:我们研究 $K$ 臂老虎机问题,其中各臂的奖励分布均支撑在 $[0,1]$ 区间上。在此设定下设计遗憾高效的随机探索算法一直是一个挑战。Maillard 采样 \cite{maillard13apprentissage} 作为汤普森采样的一种有吸引力的替代方案,最近被证明在亚高斯奖励设定中 \cite{bian2022maillard} 能实现具有竞争力的遗憾保证,同时保持闭合形式的动作概率,这对离线策略评估十分有用。在本工作中,我们提出了 Kullback-Leibler Maillard 采样(KL-MS)算法,它是 Maillard 采样的自然扩展,旨在实现 KL 形式的间隙相关遗憾界。我们证明,当奖励为伯努利分布时,KL-MS 具有渐近最优性,且其最坏情况遗憾界形式为 $O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$,其中 $\mu^*$ 是最优臂的期望奖励,$T$ 是时间范围长度。