PAC-Bayes has recently re-emerged as an effective theory with which one can derive principled learning algorithms with tight performance guarantees. However, applications of PAC-Bayes to bandit problems are relatively rare, which is a great misfortune. Many decision-making problems in healthcare, finance and natural sciences can be modelled as bandit problems. In many of these applications, principled algorithms with strong performance guarantees would be very much appreciated. This survey provides an overview of PAC-Bayes bounds for bandit problems and an experimental comparison of these bounds. On the one hand, we found that PAC-Bayes bounds are a useful tool for designing offline bandit algorithms with performance guarantees. In our experiments, a PAC-Bayesian offline contextual bandit algorithm was able to learn randomised neural network polices with competitive expected reward and non-vacuous performance guarantees. On the other hand, the PAC-Bayesian online bandit algorithms that we tested had loose cumulative regret bounds. We conclude by discussing some topics for future work on PAC-Bayesian bandit algorithms.
翻译:PAC-Bayes 近期重新兴起,成为一种有效的理论工具,可用于推导具有严格性能保证的原则性学习算法。然而,PAC-Bayes 在赌博机问题中的应用相对罕见,这实为一大遗憾。医疗、金融和自然科学中的许多决策问题可建模为赌博机问题。在这些应用中,具有强性能保证的原则性算法将备受青睐。本综述概述了 PAC-Bayes 在赌博机问题中的界限,并对这些界限进行了实验比较。一方面,我们发现 PAC-Bayes 界限是设计具有性能保证的离线赌博机算法的有用工具。在我们的实验中,一种 PAC-Bayesian 离线情景赌博机算法能够学习随机神经网络策略,并取得具有竞争力的期望奖励和非空泛的性能保证。另一方面,我们测试的 PAC-Bayesian 在线赌博机算法的累积遗憾界限较为松散。最后,我们讨论了 PAC-Bayesian 赌博机算法未来研究的一些方向。