A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models

from arxiv, Accepted to be published in the Proceedings of The 10th IEEE CSDE 2023, the Asia-Pacific Conference on Computer Science and Data Engineering 2023

Prompt injection attacks exploit vulnerabilities in large language models (LLMs) to manipulate the model into unintended actions or generate malicious content. As LLM integrated applications gain wider adoption, they face growing susceptibility to such attacks. This study introduces a novel evaluation framework for quantifying the resilience of applications. The framework incorporates innovative techniques designed to ensure representativeness, interpretability, and robustness. To ensure the representativeness of simulated attacks on the application, a meticulous selection process was employed, resulting in 115 carefully chosen attacks based on coverage and relevance. For enhanced interpretability, a second LLM was utilized to evaluate the responses generated from these simulated attacks. Unlike conventional malicious content classifiers that provide only a confidence score, the LLM-based evaluation produces a score accompanied by an explanation, thereby enhancing interpretability. Subsequently, a resilience score is computed by assigning higher weights to attacks with greater impact, thus providing a robust measurement of the application resilience. To assess the framework's efficacy, it was applied on two LLMs, namely Llama2 and ChatGLM. Results revealed that Llama2, the newer model exhibited higher resilience compared to ChatGLM. This finding substantiates the effectiveness of the framework, aligning with the prevailing notion that newer models tend to possess greater resilience. Moreover, the framework exhibited exceptional versatility, requiring only minimal adjustments to accommodate emerging attack techniques and classifications, thereby establishing itself as an effective and practical solution. Overall, the framework offers valuable insights that empower organizations to make well-informed decisions to fortify their applications against potential threats from prompt injection.

翻译：提示注入攻击利用大型语言模型的脆弱性，操纵模型执行非预期操作或生成恶意内容。随着集成大型语言模型的应用日益普及，其面临此类攻击的脆弱性也持续增加。本研究提出一种新型评估框架，用于量化应用系统的抗逆性。该框架融合了确保代表性、可解释性与鲁棒性的创新技术。为保证模拟攻击对应用系统的代表性，通过严谨的筛选流程，基于覆盖度与相关性选取了115种精心设计的攻击方式。为增强可解释性，采用第二个大型语言模型评估模拟攻击生成的响应。与仅提供置信度分数的传统恶意内容分类器不同，基于大型语言模型的评估会在输出分数时附带解释说明，从而提升可解释性。随后，通过为影响力更大的攻击赋予更高权重计算抗逆性分数，为应用系统的抗逆性提供稳健度量。为评估框架效能，将其应用于Llama2与ChatGLM两个大型语言模型。结果显示，较新版本的Llama2比ChatGLM展现出更强的抗逆性。该发现验证了框架的有效性，与"较新模型通常具有更强抗逆性"的主流认知相吻合。此外，该框架展现出卓越的泛化能力，仅需微调即可适配新兴攻击技术与分类体系，从而确立其作为高效实用解决方案的地位。总体而言，本框架提供的宝贵洞见可帮助组织做出明智决策，强化其应用系统对提示注入潜在威胁的防御能力。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日