This work is motivated by the growing demand for reproducible machine learning. We study the stochastic multi-armed bandit problem. In particular, we consider a replicable algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. We observe that existing algorithms require $O(1/\rho^2)$ times more regret than nonreplicable algorithms, where $\rho$ is the level of nonreplication. However, we demonstrate that this additional cost is unnecessary when the time horizon $T$ is sufficiently large for a given $\rho$, provided that the magnitude of the confidence bounds is chosen carefully. We introduce an explore-then-commit algorithm that draws arms uniformly before committing to a single arm. Additionally, we examine a successive elimination algorithm that eliminates suboptimal arms at the end of each phase. To ensure the replicability of these algorithms, we incorporate randomness into their decision-making processes. We extend the use of successive elimination to the linear bandit problem as well. For the analysis of these algorithms, we propose a principled approach to limiting the probability of nonreplication. This approach elucidates the steps that existing research has implicitly followed. Furthermore, we derive the first lower bound for the two-armed replicable bandit problem, which implies the optimality of the proposed algorithms up to a $\log\log T$ factor for the two-armed case.
翻译:本工作受机器学习可复现性日益增长的需求驱动。我们研究了随机多臂赌博机问题,特别关注一种可复现算法——该算法能以高概率确保算法动作序列不受数据固有随机性的影响。观察到现有算法需要$O(1/\rho^2)$倍于非可复现算法的遗憾值(其中$\rho$为不可复现水平)。然而,我们证明当时间范围$T$对于给定$\rho$足够大时,若谨慎选择置信区间幅度,则此额外成本可被消除。我们提出一种探索-提交算法:在提交单臂前均匀探索所有臂。同时,我们研究了一种逐阶段消除次优臂的连续淘汰算法。为确保算法可复现性,我们在决策过程中引入随机性。我们还将连续淘汰算法扩展到线性赌博机问题。在分析这些算法时,我们提出一种限制不可复现发生概率的原则性方法,该方法厘清了已有研究隐含遵循的步骤。此外,我们推导出双臂可复现赌博机问题的首个下界,证明所提算法在双臂情形下可达到$O(\log\log T)$因子内的最优性。