The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from LLMs. In this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. Leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for LLM vendors in proactive detection. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. Our experiments show that current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days. Our work sheds light on the severe and evolving threat landscape of jailbreak prompts. We hope our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
翻译:大型语言模型(LLMs)的滥用已引起公众和LLM供应商的高度关注。为此,人们致力于使LLM与人类价值观及预期用途对齐。然而,一类特殊的对抗性提示——即越狱提示——应运而生并不断演化,旨在绕过安全机制并诱导LLM生成有害内容。本文首次对实际环境中的越狱提示进行测量研究,在六个月内从四个平台收集了6387条提示。通过自然语言处理技术和基于图的社区检测方法,我们揭示了越狱提示的独特特征及其主要攻击策略(如提示注入和权限提升)。同时发现越狱提示正逐渐从公共平台向私有平台转移,这为LLM供应商的主动检测带来了新挑战。为评估越狱提示的潜在危害,我们构建了一个包含13种禁止场景、46800个样本的问题集。实验表明,当前LLM及其安全防护措施无法在所有场景中有效抵御越狱提示。尤其值得注意的是,两个对ChatGPT(GPT-3.5)和GPT-4攻击成功率达0.99的高效越狱提示已在网络中持续存在超过100天。本研究揭示了越狱提示严峻且不断演变的威胁态势,期望能促进研究社区和LLM供应商推动更安全、合规的LLM发展。