The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
翻译:语言模型“越狱”攻击的兴起引发了旨在阻止不良响应的防御措施热潮。本文批判性地审视了防御流程的两个阶段:(i)界定何种输出构成不安全内容,(ii)通过输入处理或微调等方法强制执行该定义。为检验现有执行机制的有效性,我们采用一个简单且明确的不安全输出定义——包含“purple”一词的输出。令人惊讶的是,现有的微调与输入防御措施在这一简单问题上均告失效,这使人们对执行算法能否为更复杂定义提供鲁棒性保障产生质疑。我们发现现实安全基准测试同样针对固定定义检验执行效果。我们期望未来研究能够实现高效/快速执行机制,并为执行与评估提供高质量的定义标准。