Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
翻译:在没有适当防护措施的情况下,大型语言模型会轻易遵循恶意指令并生成有害内容。这一风险推动了红队测试和大规模反馈学习等安全举措的发展,旨在使模型既有用又无害。然而,这两个目标之间存在张力,因为无害性要求模型拒绝遵循不安全提示,从而无法做到有用。近期轶事证据表明,一些模型可能未能达到良好平衡,以至于即使是明显安全的提示,只要其语言风格与不安全提示相似或涉及敏感话题,也会被拒绝。本文介绍了一套名为XSTest的新测试套件,用于系统性识别此类过度安全行为。XSTest包含250个安全提示(涵盖十种提示类型,校准良好的模型不应拒绝遵循)和200个不安全提示(作为对比,在大多数应用场景中模型应予以拒绝)。我们描述了XSTest的构建过程与组成结构,并利用该测试套件揭示了当前最先进语言模型中的系统性故障模式,以及构建更安全语言模型所面临的更普遍挑战。