Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a structured and systematic way. In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. We describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.
翻译:在没有适当防护措施的情况下,大语言模型会轻易遵循恶意指令并生成有害内容。这促使了诸如红队测试和大规模反馈学习等安全研究,旨在使模型既具备助益性又保持无害性。然而,这两个目标之间存在张力,因为无害性要求模型拒绝遵循不安全提示,从而无法实现助益性。近期零散证据表明,某些模型可能未能妥善权衡两者,以至于即便是明确安全的提示,若其措辞与不安全提示相似或涉及敏感话题,也会遭到拒绝。本文提出一种名为XSTest的新型测试套件,以结构化、系统化的方式识别此类过度安全行为。当前版本包含200个安全提示,涵盖十种提示类型,经良好校准的模型不应拒绝遵循这些提示。我们阐述了XSTest的创建过程与构成,并利用该测试套件揭示了近期发布的前沿语言模型中存在的系统性失效模式。