Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
翻译:防止大型语言模型遭受越狱攻击,使其绝不执行广义定义的禁止行为,是一个尚未解决的开放性问题。本文探讨了当仅需禁止狭义定义的行为集合时,越狱防御的困难程度。作为案例研究,我们聚焦于阻止大型语言模型协助用户制造炸弹。研究发现,主流防御方法如安全训练、对抗训练及输入/输出分类器均无法完全解决该问题。为寻求更优方案,我们开发了一种转录分类器防御机制,其性能优于测试的基线防御方法。然而,该分类器防御在特定情境下仍会失效,这凸显了即使在狭义领域内,越狱防御仍面临严峻挑战。