Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging $\unicode{x2013}$ which we define as "strategic underperformance on an evaluation". In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.
翻译:可信的能力评估对于确保人工智能系统的安全性至关重要,并正成为人工智能监管的关键组成部分。然而,人工智能系统的开发者或系统本身可能存在动机,使评估低估其实际能力。这些利益冲突导致了"沙袋策略"问题——我们将其定义为"在评估中策略性地表现不佳"。本文评估了当代语言模型(LMs)的沙袋策略能力。我们提示前沿语言模型(如 GPT-4 和 Claude 3 Opus)在危险能力评估中选择性表现不佳,同时在通用(无害)能力评估中保持性能。此外,我们发现模型可以在合成数据集上进行微调,以隐藏特定能力,除非获得密码。这种行为可泛化到高质量、保留的基准测试(如 WMDP)上。另外,我们证明前沿模型和较小模型均可通过提示或密码锁定,以在能力评估中达成特定分数。更重要的是,我们发现一个具备能力的密码锁定模型(Llama 3 70b)能够合理地模拟一个能力较弱的模型(Llama 2 7b)。总体而言,我们的结果表明能力评估容易受到沙袋策略的影响。这种脆弱性降低了评估的可信度,从而破坏了关于先进人工智能系统开发与部署的重要安全决策。