Prompt injection attacks exploit vulnerabilities in large language models (LLMs) to manipulate the model into unintended actions or generate malicious content. As LLM integrated applications gain wider adoption, they face growing susceptibility to such attacks. This study introduces a novel evaluation framework for quantifying the resilience of applications. The framework incorporates innovative techniques designed to ensure representativeness, interpretability, and robustness. To ensure the representativeness of simulated attacks on the application, a meticulous selection process was employed, resulting in 115 carefully chosen attacks based on coverage and relevance. For enhanced interpretability, a second LLM was utilized to evaluate the responses generated from these simulated attacks. Unlike conventional malicious content classifiers that provide only a confidence score, the LLM-based evaluation produces a score accompanied by an explanation, thereby enhancing interpretability. Subsequently, a resilience score is computed by assigning higher weights to attacks with greater impact, thus providing a robust measurement of the application resilience. To assess the framework's efficacy, it was applied on two LLMs, namely Llama2 and ChatGLM. Results revealed that Llama2, the newer model exhibited higher resilience compared to ChatGLM. This finding substantiates the effectiveness of the framework, aligning with the prevailing notion that newer models tend to possess greater resilience. Moreover, the framework exhibited exceptional versatility, requiring only minimal adjustments to accommodate emerging attack techniques and classifications, thereby establishing itself as an effective and practical solution. Overall, the framework offers valuable insights that empower organizations to make well-informed decisions to fortify their applications against potential threats from prompt injection.
翻译:提示注入攻击利用大型语言模型的脆弱性,操纵模型执行非预期操作或生成恶意内容。随着集成大型语言模型的应用日益普及,其面临此类攻击的脆弱性也持续增加。本研究提出一种新型评估框架,用于量化应用系统的抗逆性。该框架融合了确保代表性、可解释性与鲁棒性的创新技术。为保证模拟攻击对应用系统的代表性,通过严谨的筛选流程,基于覆盖度与相关性选取了115种精心设计的攻击方式。为增强可解释性,采用第二个大型语言模型评估模拟攻击生成的响应。与仅提供置信度分数的传统恶意内容分类器不同,基于大型语言模型的评估会在输出分数时附带解释说明,从而提升可解释性。随后,通过为影响力更大的攻击赋予更高权重计算抗逆性分数,为应用系统的抗逆性提供稳健度量。为评估框架效能,将其应用于Llama2与ChatGLM两个大型语言模型。结果显示,较新版本的Llama2比ChatGLM展现出更强的抗逆性。该发现验证了框架的有效性,与"较新模型通常具有更强抗逆性"的主流认知相吻合。此外,该框架展现出卓越的泛化能力,仅需微调即可适配新兴攻击技术与分类体系,从而确立其作为高效实用解决方案的地位。总体而言,本框架提供的宝贵洞见可帮助组织做出明智决策,强化其应用系统对提示注入潜在威胁的防御能力。