PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the way remained elusive for PAC-Bayes. While PAC-Bayes allows construction of data-informed priors, the final confidence intervals depend only on the number of points that were not used for the construction of the prior, whereas confidence information in the prior, which is related to the number of points used to construct the prior, is lost. This limits the possibility and benefit of sequential prior updates, because the final bounds depend only on the size of the final batch. We present a novel and, in retrospect, surprisingly simple and powerful PAC-Bayesian procedure that allows sequential prior updates with no information loss. The procedure is based on a novel decomposition of the expected loss of randomized classifiers. The decomposition rewrites the loss of the posterior as an excess loss relative to a downscaled loss of the prior plus the downscaled loss of the prior, which is bounded recursively. As a side result, we also present a generalization of the split-kl and PAC-Bayes-split-kl inequalities to discrete random variables, which we use for bounding the excess losses, and which can be of independent interest. In empirical evaluation the new procedure significantly outperforms state-of-the-art.
翻译:PAC-Bayesian分析是一种将先验知识融入学习的频率主义框架,其灵感来源于允许序列化数据处理的贝叶斯学习——贝叶斯方法自然地将前一步处理的后验转化为下一步的先验。然而,尽管经过二十五年研究,PAC-Bayes方法始终未能实现序列更新先验且不损失置信信息的目标。虽然PAC-Bayes允许构建数据驱动的先验,但最终置信区间仅取决于未用于构建先验的数据点数量,而与构建先验所用数据量相关的先验置信信息却在过程中丢失。这限制了序列先验更新的可行性与效益,因为最终边界仅取决于最终批次的数据规模。本文提出了一种新颖且(回溯来看)异常简洁强大的PAC-Bayesian流程,能够实现无信息损失的序列先验更新。该流程基于随机分类器期望损失的新型分解方法:将后验损失重写为相对于先验缩放损失的超额损失,加上可递归界定的先验缩放损失。作为衍生成果,我们还将split-kl与PAC-Bayes-split-kl不等式推广至离散随机变量,用于界定超额损失,该推广本身具有独立学术价值。实证评估表明,新方法显著优于现有最优技术。