As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attacks and defenses on LLMs generally require either expensive manual review or unreliable heuristic checks. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. Each scenario has a programmatic evaluation function to determine whether the model has broken any rules in a conversation. Our evaluations of proprietary and open models show that almost all current models struggle to follow scenario rules, even on straightforward test cases. We also demonstrate that simple optimization attacks suffice to significantly increase failure rates on test cases. We conclude by exploring two potential avenues for improvement: test-time steering and supervised fine-tuning.
翻译:随着大型语言模型(LLMs)在现实世界中承担越来越多的责任,以可靠方式指定和约束这些系统行为变得至关重要。模型开发者可能希望为模型设定明确规则(例如“不生成辱骂性内容”),但这些规则可能被越狱技术绕过。现有针对LLM的对抗攻击与防御评估通常需要昂贵的专家审查或不可靠的启发式检查。为解决这一问题,我们提出规则遵循语言评估场景(RuLES),这是一个用于测量LLM规则遵循能力的可编程框架。RuLES包含14个简单文本场景,要求模型在与用户交互时遵守各类规则。每个场景均配备可编程评估函数,用于判断模型在对话中是否违反规则。我们对闭源与开源模型的评估表明,几乎所有现有模型即使在简单测试用例中仍难以遵循场景规则。我们还发现简单的优化攻击足以显著提升测试用例的失败率。最后,我们探索两种潜在改进方向:测试时引导和基于监督微调。