Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.
翻译:大语言模型(LLMs)被部署在具有直接用户交互的场景中,例如聊天机器人和写作助手。这些部署容易受到提示注入和越狱(统称为提示劫持)的攻击,即模型被操纵忽略其原始指令并遵循潜在恶意指令。尽管这被广泛认为是重大安全威胁,但关于提示劫持的大规模资源和定量研究仍然匮乏。为填补这一空白,我们发起了一项全球提示劫持竞赛,允许自由形式的人类输入攻击。我们收集了超过60万个针对三种最先进LLM的对抗性提示。本文描述了这一数据集,实证验证了当前LLM确实可通过提示劫持受到操纵。我们还提出了一种针对对抗性提示类型的全面分类学本体。