This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness and explore the embedded defense mechanisms in these models. Our approach is distinctive for its capacity to elucidate the reasons behind the generation of harmful responses by LLMs through an incremental counterfactual methodology. By organizing the prompt modification process into four incremental levels: (word, sentence, character, and a combination of character and word) we facilitate a thorough examination of the susceptibilities inherent to LLMs. The findings from our study not only provide counterfactual explanation insight but also demonstrate that our framework significantly enhances the effectiveness of attack prompts.
翻译:本研究通过可解释的提示攻击分析,识别并缓解GPT-4和LLaMA-2等大语言模型(LLMs)的脆弱性,从而阐明加强其安全与隐私措施的迫切需求。我们提出反事实可解释增量提示攻击(CEIPA)这一创新技术,通过特定方式引导提示以量化攻击效果,并探索这些模型中嵌入的防御机制。该方法的独特之处在于能够通过增量式反事实方法论,阐明LLMs生成有害回复背后的原因。通过将提示修改过程组织为四个增量层级(词汇、句子、字符以及字符与词汇组合),我们实现了对LLMs固有脆弱性的系统化检验。研究结果不仅提供了反事实解释的洞见,还证明我们的框架能显著提升攻击提示的有效性。