We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
翻译:我们研究了离线示范数据能够在多大程度上改善在线学习。人们自然预期会有所提升,但问题在于如何提升以及提升多少?我们表明,改善程度必须依赖于示范数据的数据质量。为生成具有可移植性的见解,我们聚焦于应用于多臂赌博机(作为典型在线学习算法与模型)的汤普森采样方法。示范数据由具备特定能力水平(本文引入的概念)的专家生成。我们提出了一种利用贝叶斯规则以一致方式借鉴示范数据的信息型汤普森采样算法,并推导出依赖于先验的贝叶斯遗憾上界。这揭示了预训练如何能够大幅提升在线表现,以及提升程度如何随专家能力水平的提高而增加。此外,我们通过贝叶斯自举法开发了一种实用的近似信息型汤普森采样算法,并通过实验证明了其在经验遗憾值上的显著降低效果。