Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

from arxiv, This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks. The code is available at https://github.com/jcnf0/targeting-alignment.

翻译：大语言模型（LLM）的对齐旨在强制执行诸如安全性等指导原则。然而，当面对通过修改输入以诱导不安全输出的越狱攻击时，这种对齐机制往往会失效。本文提出并评估了一种新的越狱攻击技术。我们观察到，对齐过程在LLM中嵌入了一个安全分类器，该分类器负责在拒绝与遵从之间做出决策；我们试图提取该分类器的近似版本——即一个代理分类器。为此，我们从LLM的子集中构建候选分类器。我们首先评估了候选分类器在良性及对抗性场景下对LLM安全分类器的近似程度。随后，我们对这些候选分类器发起攻击，并测量由此产生的对抗性输入向LLM迁移的效果。评估结果表明，最佳候选分类器仅需使用模型架构的20%即可实现高精度的一致性（F1分数超过80%）。此外，我们发现针对代理分类器发起的攻击能够以较高的成功率迁移至原始LLM。例如，一个仅使用Llama 2模型50%参数的代理分类器，在内存占用和运行时间减半的情况下，攻击成功率（ASR）达到了70%——这相较于直接攻击原始LLM（我们仅观察到22%的ASR）有显著提升。这些结果表明，提取代理分类器是一种有效且高效的方法，可用于建模（并进而应对）对齐模型对越狱攻击的脆弱性。代码发布于https://github.com/jcnf0/targeting-alignment。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。