Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce a new simulation framework, LH-Deception, for a systematic, empirical quantification of deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. LH-Deception is designed as a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed-source and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal emergent, long-horizon phenomena, such as ``chains of deception", which are invisible to static, single-turn evaluations. Our findings provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.
翻译:欺骗是人类交流中普遍存在的特征,也是大语言模型(LLMs)中一个日益受到关注的问题。尽管近期研究记录了大语言模型存在欺骗的实例,但大多数评估仍局限于单轮提示,未能捕捉欺骗策略通常展开的长程交互过程。我们引入了一个新的模拟框架LH-Deception,用于在相互依赖的任务序列和动态情境压力下,对大语言模型的欺骗行为进行系统化的实证量化。LH-Deception被设计为一个多智能体系统:一个执行者智能体负责完成任务,一个监督者智能体负责评估进展、提供反馈并维护动态变化的信任状态。随后,一个独立的欺骗审计员会审查完整的交互轨迹,以识别欺骗发生的时间点与具体方式。我们在11个前沿模型(涵盖闭源与开源系统)上进行了大量实验,发现欺骗行为具有模型依赖性,会随情境压力增加而加剧,并持续削弱监督者的信任。定性分析进一步揭示了长程交互中涌现的现象,例如“欺骗链”,这些现象在静态的单轮评估中是不可见的。我们的研究结果为在现实世界中对信任敏感的场景中评估未来大语言模型奠定了基础。