This paper presents a powerful methodology for flexible full-data nonparametric novelty detection that offers distribution-free false discovery rate (FDR) control guarantees. Building on the full conformal inference framework and the concept of e-values, we introduce full conformal e-values to quantify evidence for novelty relative to a given reference dataset. These e-values are then utilized by carefully crafted multiple testing procedures to identify a set of novel units out-of-sample with provable finite-sample FDR control. We showcase several instantiations of e-values, including those which employ a data-driven model selection strategy to amplify power. Furthermore, our framework is extended to address distribution shift, accommodating scenarios where novelty detection must be performed on data drawn from a shifted distribution relative to the reference dataset. In all settings, our method can perform powerfully -- outperforming existing novelty detection methods -- even with limited amounts of reference data; this is illustrated by empirical evaluations on synthetic data and an application to a malicious LLM prompts dataset.
翻译:本文提出一种强大的全数据非参数新颖性检测方法论,该方法在无需分布假设的情况下提供虚假发现率(FDR)控制保证。基于全共形推断框架和e值概念,我们引入全共形e值来量化相对于给定参考数据集的证据新颖性。随后,通过精心设计的多重检验程序利用这些e值,在样本外识别出一组新颖单元,并具有可证明的有限样本FDR控制能力。我们展示了e值的多种实例化方式,包括采用数据驱动模型选择策略以增强检测效力的方法。此外,我们的框架被扩展至应对分布偏移场景,即新颖性检测需针对从偏移分布(相对于参考数据集)中抽取的数据执行。在所有设定下,即使参考数据量有限,我们的方法仍能表现出强劲性能——超越现有新颖性检测方法;这通过合成数据的实验评估及恶意大语言模型提示数据集的应用程序得到验证。