Safety-Efficacy Trade Off: Robustness against Data-Poisoning

Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.

翻译：后门攻击与数据投毒攻击能够实现高攻击成功率，同时规避现有基于谱分析与优化的防御机制。本文揭示该现象并非偶然，而是源于输入空间中一种基础的几何机制。以核岭回归作为宽神经网络的精确模型，我们证明了聚集性脏标签毒物会在输入Hessian矩阵中引发一个秩为一的尖峰，其幅度随攻击效能的平方而缩放。关键在于，对于非线性核函数，我们识别出一种近克隆区域：在该区域内毒物效能保持一阶量级，而诱导的输入曲率趋于零，从而使得攻击在谱分析意义上可证明不可检测。进一步研究表明，输入梯度正则化在梯度流下会压缩毒物对齐的Fisher与Hessian特征模态，通过降低数据拟合能力，形成显式且不可避免的安全性-有效性权衡。对于指数核函数，该防御机制可精确解释为一种各向异性高通滤波器，其通过增大有效长度尺度来抑制近克隆毒物。在线性模型及深度卷积网络上，基于MNIST、CIFAR-10与CIFAR-100数据集的广泛实验验证了理论：攻击成功率与谱可视性之间存在持续滞后现象，且正则化与数据增强能协同抑制投毒攻击。本研究确立了后门攻击本质不可见性的条件，并首次通过输入空间曲率对投毒攻击、可检测性与防御机制进行了端到端的系统性刻画。