Accessing machine learning models through remote APIs has been gaining prevalence following the recent trend of scaling up model parameters for increased performance. Even though these models exhibit remarkable ability, detecting out-of-distribution (OOD) samples remains a crucial safety concern for end users as these samples may induce unreliable outputs from the model. In this work, we propose an OOD detection framework, MixDiff, that is applicable even when the model's parameters or its activations are not accessible to the end user. To bypass the access restriction, MixDiff applies an identical input-level perturbation to a given target sample and a similar in-distribution (ID) sample, then compares the relative difference in the model outputs of these two samples. MixDiff is model-agnostic and compatible with existing output-based OOD detection methods. We provide theoretical analysis to illustrate MixDiff's effectiveness in discerning OOD samples that induce overconfident outputs from the model and empirically demonstrate that MixDiff consistently enhances the OOD detection performance on various datasets in vision and text domains.
翻译:随着模型参数规模扩大以提升性能的趋势日益显著,通过远程API访问机器学习模型已逐渐成为主流。尽管这些模型展现出卓越的能力,但检测分布外(OOD)样本对终端用户而言仍是关键的安全问题,因为此类样本可能导致模型产生不可靠的输出。本研究提出了一种OOD检测框架MixDiff,该框架即使在终端用户无法访问模型参数或激活值的情况下仍可适用。为突破访问限制,MixDiff对给定的目标样本与相似的分布内(ID)样本施加相同的输入级扰动,进而比较这两个样本在模型输出上的相对差异。MixDiff具有模型无关性,且兼容现有的基于输出的OOD检测方法。我们通过理论分析阐明了MixDiff在识别引发模型过度自信输出的OOD样本方面的有效性,并通过实证证明MixDiff在视觉与文本领域的多种数据集上均能持续提升OOD检测性能。