Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n=26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.
翻译:大型语言模型(LLM)已被应用于从临床辅助、法律支持到问答与教育等广泛领域。它们在专业任务中的成功引发了其具备类人语言能力(涉及组合性理解与推理)的论断。然而,逆向工程受制于莫拉维克悖论,即简单的技能往往难以实现。我们基于一项新颖基准,系统评估了7个前沿模型。模型回答了一系列理解性问题,每个问题在两种设置下被多次提示:允许单字或开放长度回复。每个问题针对一段包含高频语言结构的短文。为建立实现类人表现的基线,我们在相同提示下测试了400名人类。基于包含n=26,680个数据点的数据集,我们发现LLM的准确率处于随机水平,且答案波动显著。定量上,受测模型的表现逊于人类;定性上,其答案展现出语言理解中明显的非人类错误。我们将此证据解释为:尽管当前AI模型在各类任务中具有实用性,但其在匹配人类水平的语言理解方面仍存在不足。我们认为,这可能源于它们缺乏一种用于调控语法与语义信息的组合性操作机制。