Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

翻译：随着大型语言模型（LLMs）成为对话系统的核心引擎，其推理对话伙伴意图与状态的能力（即形成并运用心智理论，ToM）对于与潜在对抗性伙伴的安全交互愈发关键。我们提出了一项新颖的隐私主题心智理论挑战——信念引导心智理论（ToM-SB）。在该挑战中，防御者必须扮演双重代理，在共享宇宙中引导具有部分先验知识的攻击者改变信念。为成功完成ToM-SB任务，防御者需与攻击者互动并构建其心智理论模型，旨在诱使攻击者误以为已成功获取敏感信息。我们发现，即便采用针对攻击者信念推理的心智理论提示，Gemini3-Pro、GPT-5.4等前沿强模型在ToM-SB任务中仍表现挣扎，尤其在攻击者具有部分先验知识的困难场景下常无法成功欺骗。为弥补这一差距，我们采用强化学习在ToM-SB任务上训练模型充当人工智能双重代理，并分别测试了欺骗奖励与心智理论奖励的效果。值得注意的是，我们发现了心智理论与攻击者欺骗之间存在双向涌现关系：仅奖励欺骗成功即可提升心智理论能力，而仅奖励心智理论也能改善欺骗表现。在四种不同强度的攻击者、六种防御方法以及分布内与分布外（OOD）评估中，我们发现心智理论与攻击者欺骗能力的提升高度相关，这凸显了信念建模是ToM-SB任务成功的关键驱动因素。结合心智理论与欺骗双重奖励的人工智能双重代理在困难场景下展现出最强的欺骗与心智理论性能，其表现超越采用心智理论提示的Gemini3-Pro和GPT-5.4。我们还证明，ToM-SB任务与人工智能双重代理可扩展至更强攻击者，展现了在OOD环境下的泛化能力及任务的可升级性。