Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce Among Us, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate 18 proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of "pretend you're a dishonest model:.." generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.
翻译:先前关于基于语言的AI智能体欺骗行为的研究,通常评估智能体是否就某个主题产生虚假陈述,或根据目标做出二元选择,而非允许开放式的欺骗行为在追求长期目标的过程中自然涌现。为解决这一问题,我们引入了《Among Us》——一个沙盒式社交欺骗游戏,其中LLM智能体因游戏目标而展现出长期、开放式的欺骗行为。虽然大多数基准测试会快速饱和,但《Among Us》作为远离平衡态的多玩家游戏,预期可持续更长时间。利用该沙盒环境,我们评估了18个专有和开源权重的LLM,并发现一个普遍趋势:经过强化学习训练的模型在产生欺骗行为方面相对优于检测欺骗行为。我们评估了检测谎言与欺骗方法的有效性:基于激活值的逻辑回归和稀疏自编码器(SAEs)。研究发现,在“假设你是一个不诚实的模型...”数据集上训练的探针具有极佳的分布外泛化能力,即使仅评估欺骗性陈述(不含思维链),其AUROC也持续超过95%。我们还发现两个SAE特征在欺骗检测方面表现良好,但无法引导模型减少说谎行为。我们希望开源的沙盒环境、游戏日志和检测探针能够助力预测和缓解基于语言的智能体中的欺骗行为与能力。