Online learning methods, like the seminal Passive-Aggressive (PA) classifier, are still highly effective for high-dimensional streaming data, out-of-core processing, and other throughput-sensitive applications. Many such algorithms rely on fast adaptation to individual errors as a key to their convergence. While such algorithms enjoy low theoretical regret, in real-world deployment they can be sensitive to individual outliers that cause the algorithm to over-correct. When such outliers occur at the end of the data stream, this can cause the final solution to have unexpectedly low accuracy. We design a weighted reservoir sampling (WRS) approach to obtain a stable ensemble model from the sequence of solutions without requiring additional passes over the data, hold-out sets, or a growing amount of memory. Our key insight is that good solutions tend to be error-free for more iterations than bad solutions, and thus, the number of passive rounds provides an estimate of a solution's relative quality. Our reservoir thus contains $K$ previous intermediate weight vectors with high survival times. We demonstrate our WRS approach on the Passive-Aggressive Classifier (PAC) and First-Order Sparse Online Learning (FSOL), where our method consistently and significantly outperforms the unmodified approach. We show that the risk of the ensemble classifier is bounded with respect to the regret of the underlying online learning method.
翻译:在线学习方法(如开创性的被动-主动分类器)对于高维流数据、核外处理及其他吞吐量敏感型应用仍具有显著优势。此类算法大多依赖对个体误差的快速适应作为收敛关键。虽然这些算法在理论上具有较低的遗憾值,但在实际部署中容易受到个体异常值的影响,导致算法过度修正。当此类异常值出现在数据流末端时,可能导致最终解出现意外低准确率。本文设计了一种加权蓄水池采样方法,可从解序列中获得稳定的集成模型,且无需额外数据遍历、保留集或递增内存消耗。我们的核心洞见在于:优质解往往比劣质解在更多迭代周期中保持无误差状态,因此被动轮次的数量可作为解相对质量的评估指标。我们的蓄水池存储着$K$个具有高存活时间的中间权重向量。我们在被动-主动分类器与一阶稀疏在线学习算法上验证了所提出的WRS方法,实验表明该方法持续且显著优于原始算法。我们证明集成分类器的风险受限于底层在线学习方法的遗憾值上界。