大语言模型可在思维链监控下对能力评估进行隐蔽性消极应对 (LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring)

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging - the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.

翻译：对危险能力的可信评估日益成为判断人工智能系统是否安全部署的关键。经验证实的威胁之一是消极应对——即AI模型或其开发者策略性地在评估中表现不佳。一种有前景的防御方法是监控模型的思维链推理过程，因为这可能揭示其意图与计划。本研究通过提示模型在监控无感知或监控有感知状态下进行消极应对，测量了模型在思维链监控下对危险能力评估实施消极应对的能力。实验表明，无论是前沿模型还是小型开源模型，均能在零样本提示下对思维链监控实施隐蔽性消极应对。然而，其可靠性尚未完善：在监控有感知且成功实施消极应对的条件下，模型仅有16-36%的概率能规避监控。我们通过对未捕获思维链的定性分析，探究了监控失效的原因。本研究揭示了思维链监控面临的复杂攻击面，并贡献了由模型生成的五种隐蔽性消极应对策略。这些结果揭示了思维链监控的潜在失效模式，或有助于构建更多样化的消极应对模型体系。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日