We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
翻译:我们在两个随机、受控且预先注册的图灵测试中,对四个系统(ELIZA、GPT-4o、LLaMa-3.1-405B 和 GPT-4.5)在独立人群上进行了评估。参与者同时与另一名人类参与者及其中一个系统进行5分钟的对话,随后判断他们认为哪个对话伙伴是人类。当被提示采用类人角色时,GPT-4.5 被判定为人类的概率为 73%:显著高于询问者选择真实人类参与者的频率。在相同提示下,LLaMa-3.1 被判定为人类的概率为 56%——与其被比较的人类参与者相比,该频率既未显著更高也未显著更低——而基线模型(ELIZA 和 GPT-4o)的胜率则显著低于随机水平(分别为 23% 和 21%)。这些结果首次提供了经验证据,表明有系统通过了标准的三方图灵测试。该结果对于探讨大型语言模型(LLMs)所展现的是何种智能,以及这些系统可能产生的社会和经济影响等争论具有重要意义。