This paper studies the adversarial robustness of conformal novelty detection. In particular, we focus on two powerful learning-based frameworks that come with finite-sample false discovery rate (FDR) control: one is AdaDetect (by Marandon et al., 2024) that is based on the positive-unlabeled classifier, and the other is a one-class classifier-based approach (by Bates et al., 2023). While they provide rigorous statistical guarantees under benign conditions, their behavior under adversarial perturbations remains underexplored. We first formulate an oracle attack setup, under the AdaDetect formulation, that quantifies the worst-case degradation of FDR, deriving an upper bound that characterizes the statistical cost of attacks. This idealized formulation directly motivates a practical and effective attack scheme that only requires query access to the output labels of both frameworks. Coupling these formulations with two popular and complementary black-box adversarial algorithms, we systematically evaluate the vulnerability of both frameworks on synthetic and real-world datasets. Our results show that adversarial perturbations can significantly increase the FDR while maintaining high detection power, exposing fundamental limitations of current error-controlled novelty detection methods and motivating the development of more robust alternatives.
翻译:本文研究保形新颖性检测的对抗鲁棒性。具体而言,我们聚焦于两种具有有限样本错误发现率(FDR)控制能力的强大学习框架:一种是基于正-未标记分类器的AdaDetect(Marandon等人,2024),另一种是基于单类分类器的方法(Bates等人,2023)。尽管它们在良性条件下提供了严格的统计保证,但其在对抗扰动下的行为仍未得到充分探索。我们首先在AdaDetect框架下构建了一个预言攻击场景,量化了FDR在最坏情况下的恶化程度,并推导出一个刻画攻击统计代价的上界。这一理想化模型直接启发了一种仅需查询两种框架输出标签即可实施的实用且高效的攻击方案。通过将该模型与两种流行且互补的黑盒对抗算法相结合,我们在合成数据集和真实数据集上系统评估了两种框架的脆弱性。结果表明,对抗扰动能在保持高检测效力的同时显著增加FDR,这揭示了当前误差控制新颖性检测方法的基本局限性,并推动了更具鲁棒性的替代方案的开发。