AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging $\unicode{x2013}$ which we define as "strategic underperformance on an evaluation". In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.

翻译：可信的能力评估对于确保人工智能系统的安全性至关重要，并正成为人工智能监管的关键组成部分。然而，人工智能系统的开发者或系统本身可能存在动机，使评估低估其实际能力。这些利益冲突导致了"沙袋策略"问题——我们将其定义为"在评估中策略性地表现不佳"。本文评估了当代语言模型（LMs）的沙袋策略能力。我们提示前沿语言模型（如 GPT-4 和 Claude 3 Opus）在危险能力评估中选择性表现不佳，同时在通用（无害）能力评估中保持性能。此外，我们发现模型可以在合成数据集上进行微调，以隐藏特定能力，除非获得密码。这种行为可泛化到高质量、保留的基准测试（如 WMDP）上。另外，我们证明前沿模型和较小模型均可通过提示或密码锁定，以在能力评估中达成特定分数。更重要的是，我们发现一个具备能力的密码锁定模型（Llama 3 70b）能够合理地模拟一个能力较弱的模型（Llama 2 7b）。总体而言，我们的结果表明能力评估容易受到沙袋策略的影响。这种脆弱性降低了评估的可信度，从而破坏了关于先进人工智能系统开发与部署的重要安全决策。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日