Deep neural networks (DNNs) are highly susceptible to adversarial examples--subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations induce large, localized violations of layer-wise Lipschitz continuity in a small subset of layers. Building on this, we propose two complementary strategies--Recovery Testing (RT) and Logit-layer Testing (LT)--to empirically measure these violations and expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead. Furthermore, our system-level analysis provides a practical method for selecting a detection threshold with a formal lower-bound guarantee on accuracy. The code is available here: https://github.com/c0510gy/AFLS-AED.
翻译:深度神经网络(DNNs)极易受到对抗样本的攻击——这些细微、难以察觉的扰动可导致错误的预测。虽然基于检测的防御方法为对抗训练提供了一种实用的替代方案,但许多现有方法依赖于外部模型、复杂架构或对抗性数据,限制了其效率和泛化能力。我们提出了一种轻量级、即插即用的检测框架,该框架利用目标模型内部的层间不一致性,仅需良性数据进行校准。我们的方法基于“少数大偏移假设”,该假设认为对抗性扰动会在少数层中引发局部性的、大幅度的层间Lipschitz连续性破坏。基于此,我们提出了两种互补策略——恢复测试(RT)和对数层测试(LT)——以经验性地测量这些破坏,并揭示由对抗攻击引起的内部扰动。在CIFAR-10、CIFAR-100和ImageNet数据集上,针对标准及自适应威胁模型进行评估,我们的方法以可忽略的计算开销实现了最先进的检测性能。此外,我们的系统级分析提供了一种实用的检测阈值选择方法,并附有准确率的正式下界保证。代码可在此处获取:https://github.com/c0510gy/AFLS-AED。