Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward distributions and a parameter $W$. The arms arrive in a single-pass stream, and only the most recent $W$ arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with limited memory, defined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an approximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithm. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.
翻译:受在线学习中近因效应的启发,本文研究了单程*滑动窗口流式多臂老虎机(MABs)*算法。在该设定下,我们给定 $n$ 个具有未知亚高斯奖励分布的臂和一个参数 $W$。这些臂以单程流的形式到达,且仅最近 $W$ 个臂被认为是有效的。算法需在有限内存(即存储臂的数量)条件下进行纯探索与遗憾最小化。该模型是近年来被广泛研究的流式多臂老虎机模型(无滑动窗口)的自然扩展。我们针对该模型下的纯探索与遗憾最小化问题提供了全面分析。对于纯探索,我们证明在次线性内存下寻找最优臂是困难的,而寻找近似最优臂则存在高效算法。对于遗憾最小化,我们探索了一种新的遗憾概念,并给出了任何单程算法的内存-遗憾精确权衡。我们通过实验补充理论结果,展示了样本、遗憾与内存之间的权衡关系。