Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

In this paper, we report our experience with ``TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a ``World'' which defines the roles and interaction dynamics, facilitated by the platform's built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.

翻译：本文报告了我们基于"TuringHotel"项目的实验经验，该项目是对图灵测试的创新扩展，通过大型语言模型（LLMs）与人类参与者的混合社区互动实现。经典的一对一图灵测试被重新诠释为群体场景，其中人类与人工智能体同时参与限时讨论，且有趣的是，两者既担任裁判又担任应答者。该社区通过新型平台UNaIVERSE（https://unaiverse.io）具体实现，创建了一个"世界"来定义角色与互动动态，并借助平台内置编程工具完成。所有通信均通过经过身份验证的点对点网络进行，确保无第三方可访问信息交换。平台还提供了可通过移动设备和笔记本电脑访问的统一人类接口，这是本文实验体验的关键组成部分。涉及17名人类参与者和19个LLMs的实验结果显示，当前模型仍会偶尔被误判为人类。值得注意的是，实验中出现了若干意外错误，表明尽管人工智能参与者具备高质量的语言能力，人类"指纹"仍可被识别但并非完全明确。我们认为这是首个在分布式场景下开展的此类实验，类似倡议可能具有国家层面的价值，用于支持跟踪大型语言模型随时间演化的持续实验与竞赛。

相关内容

图灵测试

关注 2

图灵测试（英语：Turing test，又译图灵试验）是图灵于1950年提出的一个关于判断机器是否能够思考的著名试验，测试某机器是否能表现出与人等价或无法区分的智能。测试的谈话仅限于使用唯一的文本管道，例如计算机键盘和屏幕，这样的结果是不依赖于计算机把单词转换为音频的能力。 Source: 图灵测试

Claw AI Lab：从自动写论文到交互式AI研究实验室

专知会员服务

15+阅读 · 5月24日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

基于大语言模型的智能体易产生幻觉：分类体系、方法与未来方向综述

专知会员服务

33+阅读 · 2025年9月27日

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日