How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.
翻译:如何最佳地利用历史数据对赌博机算法进行“热启动”仍是一个开放问题:直接使用所有历史样本初始化奖励估计可能受到虚假数据和不平衡数据覆盖的影响,导致计算和存储问题——在连续动作空间中尤为突出。我们提出人工重放(Artificial Replay),这是一种将历史数据融入任意基础赌博机算法的元算法。与完全热启动方法相比,人工重放仅使用少量历史数据,但对于满足我们提出的新颖且广泛适用的“无关数据独立性”(IIData)属性的基础算法,仍能实现相同的遗憾值。我们通过$K$臂和连续组合赌博机算法的实验(包括使用真实偷猎数据的绿色安全领域)补充了这些理论结果,展示了人工重放在不满足IIData的基础算法中的实际优势。