XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.

翻译：在没有适当防护措施的情况下，大型语言模型会轻易遵循恶意指令并生成有害内容。这一风险推动了红队测试和大规模反馈学习等安全举措的发展，旨在使模型既有用又无害。然而，这两个目标之间存在张力，因为无害性要求模型拒绝遵循不安全提示，从而无法做到有用。近期轶事证据表明，一些模型可能未能达到良好平衡，以至于即使是明显安全的提示，只要其语言风格与不安全提示相似或涉及敏感话题，也会被拒绝。本文介绍了一套名为XSTest的新测试套件，用于系统性识别此类过度安全行为。XSTest包含250个安全提示（涵盖十种提示类型，校准良好的模型不应拒绝遵循）和200个不安全提示（作为对比，在大多数应用场景中模型应予以拒绝）。我们描述了XSTest的构建过程与组成结构，并利用该测试套件揭示了当前最先进语言模型中的系统性故障模式，以及构建更安全语言模型所面临的更普遍挑战。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日