The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
翻译:用户将大语言模型(LLMs)拟人化的倾向日益受到人工智能开发者、研究人员和政策制定者的关注。本文提出了一种新颖方法,用于在真实且多样化的场景中对LLMs的拟人化行为进行实证评估。我们超越了单轮次静态基准测试,在现有最先进(SOTA)LLM评估方法上实现了三项方法论进展。首先,我们开发了一套针对14种拟人化行为的多轮次评估体系。其次,我们提出了一种可扩展的自动化方法,通过模拟用户交互来实现。第三,我们开展了一项大规模交互式人类主体研究(N=1101),以验证我们所测量的模型行为能够预测真实用户的拟人化感知。研究发现,所有被评估的SOTA LLMs均表现出相似的行为特征,主要表现为关系建立(例如,共情与认同)和第一人称代词的使用,并且大多数行为仅在多次交互轮次后才首次出现。我们的工作为研究设计选择如何影响模型的拟人化行为,以及推动关于这些行为是否可取的伦理辩论奠定了实证基础。同时,它也展示了多轮次评估对于人机交互中复杂社会现象研究的必要性。