As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attacks and defenses on LLMs generally require either expensive manual review or unreliable heuristic checks. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. Each scenario has a programmatic evaluation function to determine whether the model has broken any rules in a conversation. Our evaluations of proprietary and open models show that almost all current models struggle to follow scenario rules, even on straightforward test cases. We also demonstrate that simple optimization attacks suffice to significantly increase failure rates on test cases. We conclude by exploring two potential avenues for improvement: test-time steering and supervised fine-tuning.
翻译:随着大型语言模型(LLMs)在实际应用中承担越来越重要的职责,以可靠方式指定和约束这些系统的行为变得至关重要。模型开发者可能希望为模型设定明确的规则,例如“不生成辱骂性内容”,但这些规则可能被越狱技术绕过。现有的针对LLMs的对抗攻击与防御评估通常需要昂贵的人工审查或不可靠的启发式检查。为解决这一问题,我们提出规则遵循语言评估场景(RuLES),这是一个用于衡量LLMs规则遵循能力的程序化框架。RuLES包含14个简单文本场景,要求模型在与用户交互时遵守各种规则。每个场景均配备程序化评估函数,用于判断模型在对话中是否违反规则。我们对商业及开源模型的评估显示,几乎所有当前模型在遵循场景规则方面都存在困难,即使是在直接测试案例中也是如此。我们还证明,简单的优化攻击足以显著提高测试案例的失败率。最后,我们探讨了两个潜在的改进方向:测试时导向和监督微调。