We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
翻译:我们研究了离线示范数据能在多大程度上改进在线学习。自然期望会有所改善,但问题在于如何改善,以及改善程度有多大?我们证明,改善程度必然取决于示范数据的质量。为生成可移植的洞见,我们聚焦于应用于多臂老虎机问题的汤普森采样(TS)作为原型在线学习算法与模型。示范数据由具有特定能力水平(我们引入的概念)的专家生成。我们提出了一种知情汤普森采样算法,通过贝叶斯规则以一致方式利用示范数据,并推导出与先验相关的贝叶斯遗憾界。这揭示了预训练如何能大幅提升在线性能,以及改进程度如何随专家能力水平增强而增加。我们还通过贝叶斯自举法开发了一种实用的近似知情汤普森采样算法,并通过实验展示了显著的实证遗憾降低。