Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

Despite being a heavily researched topic, Adversarial Training (AT) is rarely, if ever, deployed in practical AI systems for two primary reasons: (i) the gained robustness is frequently accompanied by a drop in generalization and (ii) generating adversarial examples (AEs) is computationally prohibitively expensive. To address these limitations, we propose SMAAT, a new AT algorithm that leverages the manifold conjecture, stating that off-manifold AEs lead to better robustness while on-manifold AEs result in better generalization. Specifically, SMAAT aims at generating a higher proportion of off-manifold AEs by perturbing the intermediate deepnet layer with the lowest intrinsic dimension. This systematically results in better scalability compared to classical AT as it reduces the PGD chains length required for generating the AEs. Additionally, our study provides, to the best of our knowledge, the first explanation for the difference in the generalization and robustness trends between vision and language models, ie., AT results in a drop in generalization in vision models whereas, in encoder-based language models, generalization either improves or remains unchanged. We show that vision transformers and decoder-based models tend to have low intrinsic dimensionality in the earlier layers of the network (more off-manifold AEs), while encoder-based models have low intrinsic dimensionality in the later layers. We demonstrate the efficacy of SMAAT; on several tasks, including robustifying (i) sentiment classifiers, (ii) safety filters in decoder-based models, and (iii) retrievers in RAG setups. SMAAT requires only 25-33% of the GPU time compared to standard AT, while significantly improving robustness across all applications and maintaining comparable generalization.

翻译：尽管对抗训练（AT）是一个被广泛研究的课题，但在实际的AI系统中却很少甚至从未得到部署，主要原因有二：（i）所获得的鲁棒性常常伴随着泛化性能的下降；（ii）生成对抗样本（AEs）的计算成本过高。为应对这些局限，我们提出了SMAAT，一种新的对抗训练算法。该算法基于流形猜想，即离流形对抗样本能带来更好的鲁棒性，而在流形对抗样本则能实现更好的泛化。具体而言，SMAAT旨在通过扰动具有最低本征维度的中间深度网络层，来生成更高比例的离流形对抗样本。与经典对抗训练相比，这系统性提升了可扩展性，因为它缩短了生成对抗样本所需的PGD链长度。此外，据我们所知，本研究首次解释了视觉模型与语言模型在泛化与鲁棒性趋势上的差异：即对抗训练会导致视觉模型泛化性能下降，而在基于编码器的语言模型中，泛化性能要么提升要么保持不变。我们发现，视觉Transformer和基于解码器的模型倾向于在网络较早的层中具有较低的本征维度（产生更多离流形对抗样本），而基于编码器的模型则在较深层具有较低的本征维度。我们验证了SMAAT的有效性：在多项任务中，包括（i）情感分类器的鲁棒化，（ii）基于解码器模型中的安全过滤器，以及（iii）RAG设置中的检索器。与标准对抗训练相比，SMAAT仅需25-33%的GPU时间，同时在所有应用中显著提升鲁棒性，并保持相当的泛化性能。