While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper
翻译:尽管大语言模型(LLMs)越来越多地被用于实际应用,它们仍然容易受到提示注入攻击:恶意第三方提示会颠覆系统设计者的意图。为了帮助研究人员研究这一问题,我们提供了一个包含超过126,000个提示注入攻击和46,000个基于提示的针对提示注入的“防御”数据集,所有这些数据均由一款名为Tensor Trust的在线游戏玩家生成。据我们所知,这是目前最大的、针对遵循指令的大语言模型的人类生成对抗性示例数据集。我们数据集中的攻击具有大量易于解释的结构,并揭示了LLMs的弱点。我们还利用该数据集创建了一个基准,用于评估对两种提示注入类型(我们称之为提示提取和提示劫持)的抵抗力。我们的基准测试结果表明,许多模型容易受到Tensor Trust数据集中攻击策略的影响。此外,我们展示了数据集中的一些攻击策略能够泛化到已部署的基于LLM的应用中,即使这些应用与游戏相比具有截然不同的约束条件。我们在https://tensortrust.ai/paper 上发布了所有数据和源代码。