Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.
翻译:大语言模型(LLMs)已部署在聊天机器人和写作助手等需要直接用户交互的环境中。这些部署容易受到提示注入和越狱(统称为提示注入攻击)的威胁,模型可能被操纵以忽略原始指令并执行恶意指令。尽管提示注入被广泛视为重大安全威胁,但目前缺乏大规模资源及关于该问题的定量研究。为填补这一空白,我们发起了一项全球提示注入竞赛,允许自由形式的人类输入攻击。我们向三个最先进的大语言模型发起了超过60万次对抗性提示攻击。本文描述了该数据集,实证验证了当前大语言模型确实可被提示注入所操纵,同时系统性地提出了对抗性提示类型的分类学本体框架。