Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n=26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.

翻译：大型语言模型（LLM）已被应用于从临床辅助、法律支持到问答与教育等广泛领域。它们在专业任务中的成功引发了其具备类人语言能力（涉及组合性理解与推理）的论断。然而，逆向工程受制于莫拉维克悖论，即简单的技能往往难以实现。我们基于一项新颖基准，系统评估了7个前沿模型。模型回答了一系列理解性问题，每个问题在两种设置下被多次提示：允许单字或开放长度回复。每个问题针对一段包含高频语言结构的短文。为建立实现类人表现的基线，我们在相同提示下测试了400名人类。基于包含n=26,680个数据点的数据集，我们发现LLM的准确率处于随机水平，且答案波动显著。定量上，受测模型的表现逊于人类；定性上，其答案展现出语言理解中明显的非人类错误。我们将此证据解释为：尽管当前AI模型在各类任务中具有实用性，但其在匹配人类水平的语言理解方面仍存在不足。我们认为，这可能源于它们缺乏一种用于调控语法与语义信息的组合性操作机制。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日