Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, in our evaluation of seven open-source LLMs, we observe an average attack success rate of 99.14%, based on the classic keyword-matching criterion. Finally, we provide insights into the safety mechanism of LLMs. The code is available at https://github.com/SproutNan/AI-Safety_SCAV.
翻译:尽管经过谨慎的安全对齐,当前的大型语言模型(LLMs)仍易受各类攻击。为深入揭示LLMs的安全风险,我们引入了安全概念激活向量(SCAV)框架,该框架通过精准解读LLMs的安全机制有效引导攻击。我们进而开发了一种SCAV引导的攻击方法,能够生成攻击提示词和嵌入层攻击,并自动选择扰动超参数。自动评估与人工评估均表明,我们的攻击方法在显著提升攻击成功率与响应质量的同时,所需训练数据更少。此外,我们发现生成的攻击提示词可迁移至GPT-4,而嵌入层攻击亦可能迁移至其他参数已知的白盒LLMs。实验进一步揭示了当前LLMs中存在的安全风险:例如,在对七个开源LLMs的评估中,基于经典关键词匹配准则观测到平均99.14%的攻击成功率。最后,我们对LLMs的安全机制提出了见解。代码发布于https://github.com/SproutNan/AI-Safety_SCAV。