Using unlabeled data to regularize the machine learning models has demonstrated promise for improving safety and reliability in detecting out-of-distribution (OOD) data. Harnessing the power of unlabeled in-the-wild data is non-trivial due to the heterogeneity of both in-distribution (ID) and OOD data. This lack of a clean set of OOD samples poses significant challenges in learning an optimal OOD classifier. Currently, there is a lack of research on formally understanding how unlabeled data helps OOD detection. This paper bridges the gap by introducing a new learning framework SAL (Separate And Learn) that offers both strong theoretical guarantees and empirical effectiveness. The framework separates candidate outliers from the unlabeled data and then trains an OOD classifier using the candidate outliers and the labeled ID data. Theoretically, we provide rigorous error bounds from the lens of separability and learnability, formally justifying the two components in our algorithm. Our theory shows that SAL can separate the candidate outliers with small error rates, which leads to a generalization guarantee for the learned OOD classifier. Empirically, SAL achieves state-of-the-art performance on common benchmarks, reinforcing our theoretical insights. Code is publicly available at https://github.com/deeplearning-wisc/sal.
翻译:利用无标注数据对机器学习模型进行正则化,在提升离群数据检测的安全性与可靠性方面展现出潜力。由于分布内数据与离群数据均具有异质性,有效利用无标注的野外数据并非易事。缺乏纯净的离群样本集对学习最优离群分类器构成了重大挑战。目前,关于无标注数据如何助力离群检测的形式化理解仍存在研究空白。本文通过提出一种兼具强理论保证与实证有效性的新学习框架SAL(分离与学习)来弥补这一空白。该框架从无标注数据中分离候选离群点,然后利用候选离群点与带标签的分布内数据训练离群分类器。理论上,我们从可分性与可学习性视角提供了严格的误差界,形式化地论证了算法中两个组件的合理性。理论表明,SAL能以较小错误率分离候选离群点,从而为学习到的离群分类器提供泛化保证。实证方面,SAL在通用基准测试上取得了最先进性能,强化了我们的理论洞察。代码公开于https://github.com/deeplearning-wisc/sal。