As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.
翻译:随着大型语言模型迅速普及,理解其安全漏洞变得至关重要。最新研究表明,文本优化器可生成绕过内容审核与对齐机制的越狱提示。借鉴对抗性机器学习的丰富研究成果,我们从三个角度探讨这些攻击:该领域内哪些威胁模型具有实际应用价值?基线防御技术在该新领域中的表现如何?大语言模型安全性与计算机视觉有何差异?我们评估了多种基线防御策略对抗大语言模型主流对抗攻击的效果,讨论了每种策略在不同场景下的可行性与有效性。重点研究了三种防御类型:检测(基于困惑度)、输入预处理(释义重构与重分词)及对抗训练。我们分别讨论了白盒与灰盒场景,并分析了各类防御的鲁棒性-性能权衡。研究发现,现有文本离散优化器的局限性结合较高的优化成本,使得标准自适应攻击对大语言模型更具挑战性。未来需进一步研究更强大的优化器开发可能性,以及过滤与预处理防御在大语言模型领域是否比计算机视觉领域更具优势。