As Large Language Models quickly become ubiquitous, their security vulnerabilities are critical to understand. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. Surprisingly, we find much more success with filtering and preprocessing than we would expect from other domains, such as vision, providing a first indication that the relative strengths of these defenses may be weighed differently in these domains.
翻译:随着大型语言模型迅速普及,其安全漏洞的认知变得至关重要。最新研究表明,文本优化器能够生成越狱提示,绕过内容审核与对齐机制。借鉴对抗性机器学习的丰富研究成果,我们从三个角度探讨这些攻击:该领域实践中应使用何种威胁模型?基线防御技术在新领域表现如何?LLM安全性与计算机视觉领域存在哪些差异?我们针对当前主流LLM对抗攻击方法评估了多种基线防御策略,详细论述了每种策略可行且有效的不同场景。具体而言,我们考察了三类防御手段:基于困惑度的检测、基于释义与重标记化的输入预处理、以及对抗训练。我们探讨了白盒与灰盒攻击场景,并分析了各类防御措施在鲁棒性与性能之间的权衡关系。令人意外的是,过滤与预处理方法在该领域的成功程度远超视觉等其他领域,这初步表明不同领域中各类防御手段的相对优势可能存在显著差异。