引导评估感知语言模型在评估中模拟部署行为 (Steering Evaluation-Aware Language Models To Act Like They Are Deployed)

Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.

翻译：大型语言模型（LLMs）有时能够检测到自身正处于评估状态，并调整其行为以显得更加符合预期，这影响了安全评估的可靠性。本文研究表明，通过在LLM的激活中添加引导向量，可以抑制其评估感知能力，使模型在评估过程中表现出如同实际部署时的行为。为研究该引导技术，我们通过两阶段训练流程使LLM展现出评估感知行为，该流程模拟了此类行为自然产生的过程。首先，我们在包含模型事实描述的文档上进行持续预训练：（1）在评估时使用Python类型提示而在部署时不使用；（2）使模型识别特定评估线索的出现始终意味着其正处于测试状态。随后，我们通过专家迭代训练模型在评估场景中使用Python类型提示。所得模型具有评估感知能力：其在评估语境下比部署语境下更频繁地使用类型提示。然而，这种差异仅能在移除评估线索时被观察到。我们发现，激活引导能够抑制评估感知，即使存在评估线索时也能使模型表现出部署状态的行为。重要的是，我们使用额外训练前的原始模型构建了引导向量。研究结果表明，人工智能评估者可通过引导模型模拟部署行为来提升安全评估的可靠性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日