Large Language Model (LLM) agents are increasingly used in many applications, raising concerns about their safety. While previous work has shown that LLMs can deceive in controlled tasks, less is known about their ability to deceive using natural language in social contexts. In this paper, we study deception in the Social Deduction Game (SDG) Mafia, where success is dependent on deceiving others through conversation. Unlike previous SDG studies, we use an asynchronous multi-agent framework which better simulates realistic social contexts. We simulate 35 Mafia games with GPT-4o LLM agents. We then create a Mafia Detector using GPT-4-Turbo to analyze game transcripts without player role information to predict the mafia players. We use prediction accuracy as a surrogate marker for deception quality. We compare this prediction accuracy to that of 28 human games and a random baseline. Results show that the Mafia Detector's mafia prediction accuracy is lower on LLM games than on human games. The result is consistent regardless of the game days and the number of mafias detected. This indicates that LLMs blend in better and thus deceive more effectively. We also release a dataset of LLM Mafia transcripts to support future research. Our findings underscore both the sophistication and risks of LLM deception in social contexts.
翻译:大型语言模型(LLM)智能体在众多应用中的使用日益广泛,引发了对其安全性的担忧。尽管先前研究表明LLM在受控任务中能够实施欺骗,但其在社交语境中使用自然语言进行欺骗的能力尚不明确。本文通过社交推理游戏《Mafia》研究欺骗行为,该游戏中成功依赖于通过对话欺骗他人。与以往的社交推理游戏研究不同,我们采用异步多智能体框架,以更好地模拟真实社交情境。我们使用GPT-4o LLM智能体模拟了35场《Mafia》游戏,并构建了基于GPT-4-Turbo的《Mafia》检测器,该检测器在不获取玩家角色信息的情况下分析游戏文本记录以预测黑手党玩家。我们将预测准确率作为欺骗质量的替代指标,并将其与28场人类游戏及随机基线的预测准确率进行比较。结果显示,《Mafia》检测器对LLM游戏的预测准确率低于对人类游戏的预测准确率。这一结果在不同游戏天数及检测到的黑手党数量条件下均保持一致,表明LLM能更好地融入环境从而实现更有效的欺骗。我们还发布了LLM《Mafia》游戏文本数据集以支持未来研究。我们的发现揭示了LLM在社交语境中进行欺骗的复杂性与潜在风险。