Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs. However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.
翻译:提高已部署机器学习系统的可靠性,通常需要开发检测分布外(OOD)输入的方法。然而,现有研究往往狭隘地关注训练集中缺失类别的样本,忽视了其他类型的合理分布偏移。这一局限性降低了这些方法在现实场景中的适用性,因为系统会遭遇各种异常输入。在本研究中,我们分类了五种不同类型的分布偏移,并严格评估了近期OOD检测方法在每种偏移上的表现。我们以BROAD(异常多样性下的鲁棒性基准测试)为名公开发布了基准测试集。我们的研究发现,虽然这些方法在检测未知类别方面表现出色,但它们在应对其他类型分布偏移时性能不稳定。换句话说,它们仅能可靠地检测到被专门设计用于预期的意外输入。作为迈向广泛分布外检测的第一步,我们利用高斯混合模型学习现有检测得分的生成模型。通过这种方式,我们提出了一种集成方法,为广泛的分布外检测提供了更一致且全面的解决方案,显示出优于现有方法的性能。我们的代码用于下载BROAD及复现实验,均已公开。