The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing the output of undesirable responses. In this work, we critically examine the two stages of the defense pipeline: (i) the definition of what constitutes unsafe outputs, and (ii) the enforcement of the definition via methods such as input processing or fine-tuning. We cast severe doubt on the efficacy of existing enforcement mechanisms by showing that they fail to defend even for a simple definition of unsafe outputs--outputs that contain the word "purple". In contrast, post-processing outputs is perfectly robust for such a definition. Drawing on our results, we present our position that the real challenge in defending jailbreaks lies in obtaining a good definition of unsafe responses: without a good definition, no enforcement strategy can succeed, but with a good definition, output processing already serves as a robust baseline albeit with inference-time overheads.
翻译:“越狱”攻击语言模型的兴起引发了大量旨在防止输出不良响应的防御措施。本研究批判性地审视了防御流程的两个阶段:(i)不安全输出内容的定义,以及(ii)通过输入处理或微调等方法对定义的实施。我们证明,即使对于简单的定义——输出包含单词“purple”——现有实施机制也未能有效防御,从而对其效能提出严重质疑。相比之下,针对此类定义,输出后处理具有完美的鲁棒性。基于我们的结果,我们提出观点:防御越狱的真正挑战在于获得对不良响应的精确定义——没有良好的定义,任何实施策略都无法成功;而在精确定义下,输出处理虽带来推理时开销,却已构成稳健的基线方法。