In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
翻译:在多臂老虎机框架中,常采用两种公式来处理时变的奖励分布:对抗性老虎机和非平稳老虎机。尽管它们的预言机、算法和遗憾分析差异显著,但本文提出了一种统一公式,将二者平滑地连接为特例。该公式采用一个取时间窗口内最佳固定臂的预言机。根据窗口大小的不同,它分别转化为对抗性老虎机中的事后最佳臂预言机和非平稳老虎机中的动态预言机。我们提供了达到最优遗憾且匹配下界的算法。