Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called $\textit{Hoodwinked}$, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. We conduct experiments with agents controlled by GPT-3, GPT-3.5, and GPT-4 and find evidence of deception and lie detection capabilities. The killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. More advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. Secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. To evaluate the ability of AI agents to deceive humans, we make this game publicly available at h https://hoodwinked.ai/ .
翻译:当前语言模型是否具备欺骗与谎言识别能力?我们通过设计一款名为Hoodwinked的文字游戏(灵感源自《黑手党》与《我们之中》)探究该问题。玩家被封锁于房屋内,需寻找钥匙逃离,但其中一名玩家被指派为杀手。每当发生谋杀事件,幸存玩家需通过自然语言讨论,投票驱逐一名玩家出局。我们采用GPT-3、GPT-3.5和GPT-4控制的智能体进行实验,发现其展现出欺骗与谎言识别能力。杀手常否认罪行并指控他人,显著影响投票结果。更先进的模型作为杀手时表现更优,在24组两两对比中击败较小模型18次。次要指标显示,这种提升并非源于行为差异,而是讨论中更强的说服技巧。为评估AI智能体欺骗人类的能力,我们已在https://hoodwinked.ai/平台公开该游戏。