RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

翻译：原始图灵测试要求人类评判员通过对话区分机器与人。七十五年后，对话系统已在非正式场景中通过了该测试；认识论层面的有趣问题也随之转移。我们认为，现代相关变体并非询问对话伙伴是否为人工造物，而是询问其是否值得信赖。本文提出RogueAI——一个将这一修正版测试具体化为一对二审讯游戏的交互式网页应用：人类玩家对两个无法区分的LLM智能体进行提问，并知晓在共享虚构场景中恰有一个被授权实施欺骗。玩家需在回合预算耗尽前识别出欺骗性智能体并将其"关闭"。我们进一步引入AutoRogueAI这一程序化扩展，允许玩家与叙事者智能体协同设计自定义场景，该智能体秘密选择自身的欺骗策略。本文阐述了框架设计，勾勒了抽象架构与游戏循环，并将该工件置于近期关于LLM欺骗、社交推理基准测试及基于辩论的可扩展监督研究脉络中。为期三天的试点部署（467次发起会话，415次完成，1876次意大利语交互轮次）提供了初步可行性证据，并揭示了一组具体矛盾：欺骗性智能体携带可靠且局部可辨的言词特征——差异化的帮助性、简洁性及模糊性——简单启发式策略借此实现了75.6%的准确率，而人类玩家仅达到56.6%，与完全忽略最具诊断性信号的表现一致。我们讨论了这一差距对于该工件作为数据收集工具、教学工具及诚实性训练模型评估装置的应用启示。

相关内容

图灵测试

关注 2

图灵测试（英语：Turing test，又译图灵试验）是图灵于1950年提出的一个关于判断机器是否能够思考的著名试验，测试某机器是否能表现出与人等价或无法区分的智能。测试的谈话仅限于使用唯一的文本管道，例如计算机键盘和屏幕，这样的结果是不依赖于计算机把单词转换为音频的能力。 Source: 图灵测试

《人工智能红队测试的再审视》

专知会员服务

16+阅读 · 2025年9月2日

【NeurIPS 2024】HaloScope：利用未标记的大型语言模型生成进行幻觉检测

专知会员服务

20+阅读 · 2024年9月27日

揭秘ChatGPT情感对话能力

专知会员服务

59+阅读 · 2023年4月9日

《AI系统对抗性测试与评估的反人工智能工具系统设计》2022论文，美国西点军校

专知会员服务

89+阅读 · 2023年1月22日