We evaluated GPT-4 in a public online Turing Test. The best-performing GPT-4 prompt passed in 41% of games, outperforming baselines set by ELIZA (27%) and GPT-3.5 (14%), but falling short of chance and the baseline set by human participants (63%). Participants' decisions were based mainly on linguistic style (35%) and socio-emotional traits (27%), supporting the idea that intelligence is not sufficient to pass the Turing Test. Participants' demographics, including education and familiarity with LLMs, did not predict detection rate, suggesting that even those who understand systems deeply and interact with them frequently may be susceptible to deception. Despite known limitations as a test of intelligence, we argue that the Turing Test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.
翻译:我们在公开在线图灵测试中评估了GPT-4。表现最佳的GPT-4提示在41%的游戏中通过测试,优于ELIZA(27%)和GPT-3.5(14%)设定的基线,但低于随机概率和人类参与者设定的基线(63%)。参与者的判断主要基于语言风格(35%)和社会情感特征(27%),这支持了智力不足以通过图灵测试的观点。参与者的人口统计特征(包括教育背景和对大语言模型的熟悉程度)并未预测检测率,表明即使是深度理解系统并频繁与其交互的人也可能容易受骗。尽管图灵测试作为智力测试存在公认的局限性,我们主张该测试作为自然沟通与欺骗能力的评估仍然具有现实意义。具备模仿人类能力的AI模型可能产生广泛的社会影响,我们分析了判断人类相似度的不同策略与标准的有效性。