LLMs can generate human-like dialogues, yet their ability to simulate early child-adult interactions remains largely unexplored. In this paper, we examined how effectively LLMs can capture the distinctive features of child-caregiver language in interaction, using both static and interactive benchmarking methods. We found that state-of-the-art LLMs like Llama 3 and GPT-4o can approximate child-caregiver dialogues at the word and utterance level, but they struggle to reproduce the child and caregiver's discursive patterns, exaggerate alignment, and fail to reach the level of diversity shown by humans. The broader goal of this work is to initiate the development of a comprehensive benchmark for LLMs in child-oriented applications.
翻译:大语言模型能够生成类人对话,但其模拟早期儿童-成人互动的能力在很大程度上仍未得到探索。本文通过静态和交互式基准测试方法,研究了大语言模型在捕捉儿童-看护者互动语言特征方面的有效性。研究发现,像Llama 3和GPT-4o这样的先进大语言模型能够在词汇和话语层面近似儿童-看护者对话,但它们难以复现儿童与看护者的话语模式,会夸大对话对齐程度,并且无法达到人类所展现的多样性水平。这项工作的更广泛目标是启动面向儿童应用的大语言模型综合基准的开发。