As the use of machine learning continues to expand, the importance of ensuring its safety cannot be overstated. A key concern in this regard is the ability to identify whether a given sample is from the training distribution, or is an "Out-Of-Distribution" (OOD) sample. In addition, adversaries can manipulate OOD samples in ways that lead a classifier to make a confident prediction. In this study, we present a novel approach for certifying the robustness of OOD detection within a $\ell_2$-norm around the input, regardless of network architecture and without the need for specific components or additional training. Further, we improve current techniques for detecting adversarial attacks on OOD samples, while providing high levels of certified and adversarial robustness on in-distribution samples. The average of all OOD detection metrics on CIFAR10/100 shows an increase of $\sim 13 \% / 5\%$ relative to previous approaches.
翻译:随着机器学习的广泛应用,保障其安全性的重要性愈发凸显。其中核心问题之一在于判别给定样本是否来自训练数据分布,即是否为"分布外"(Out-Of-Distribution, OOD)样本。此外,攻击者可能通过操作OOD样本,诱导分类器产生高置信度的预测。本研究提出了一种新颖方法,可在输入样本的$\ell_2$范数邻域内认证OOD检测的鲁棒性,该方法无需依赖特定网络架构组件或附加训练。同时,我们优化了面向OOD样本的对抗攻击检测技术,在保持分布内样本高水准认证鲁棒性与对抗鲁棒性的前提下,实现了性能提升。在CIFAR10/100数据集上,所有OOD检测指标的平均值较先前方法提升了约$13\%/5\%$。