We study a stochastic multi-armed bandit problem in which the set of available arms expands over time. This setting arises in sequential experimentation when new actions or treatments become available during an ongoing study, making regret against a single best arm in hindsight inappropriate. We instead evaluate performance relative to the best arm currently available, leading to a dynamic-regret criterion for arriving-arm environments. To address the resulting challenges of arrival information discrepancy (AID) and a drifting benchmark (DB), we propose UCB for Arriving Arms (UCB-AA), an elimination-based procedure with an aiding preliminary screening step for newly arrived arms before full competition with incumbent arms. We show that UCB-AA attains regret bounds that depend explicitly on the arrival process, achieves sublinear dynamic regret under regularity conditions on gap evolution, and admits an online extension for unknown horizons. Simulation results show that UCB-AA reduces wasted pulls and maintains a smaller active arm set while preserving competitive regret performance.
翻译:我们研究了一种随机多臂赌博机问题,其中可用臂集合随时间扩展。该设置出现在当新动作或治疗方案在 ongoing 研究期间变得可用时的序列实验中,使得针对事后最优单一臂的遗憾不再适用。我们转而根据当前可用最佳臂评估性能,从而为到达臂环境引入动态遗憾准则。为应对由此产生的到达信息差异和漂移基准的挑战,我们提出了面向到达臂的UCB算法(UCB-AA),这是一种基于淘汰的流程,其中包含一个辅助性初步筛选步骤,使新到达的臂在充分竞争前先与现有臂进行比较。我们证明了UCB-AA的遗憾界显式依赖于到达过程,在间隙演化的正则条件下实现次线性动态遗憾,并支持未知时间范围下的在线扩展。仿真结果表明,UCB-AA在保持竞争性遗憾性能的同时,减少了浪费的拉取次数并维持了更小的活跃臂集合。