The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs.
翻译:大语言模型(LLMs)的开发与评估主要聚焦于其任务解决能力,近期模型在某些领域甚至超越了人类表现。然而,这种关注往往忽视了机器生成语言在词汇选择、句法结构和意义表达方面是否达到人类水平的多样性,从而引发了对语言生成基础问题是否得到充分解决的质疑。鉴于LLMs生成或辅助生成的在线内容激增的现状,本文强调审视语言模型对人类语言丰富性保持能力的重要性。我们提出了一个从词汇、句法和语义等多个语言多样性维度评估LLMs的综合框架。运用该框架,我们对多个前沿LLMs在所有多样性维度上进行了基准测试,并针对句法多样性开展了深入的案例研究。最后,我们分析了不同的开发与部署选择如何影响LLM输出的语言多样性。