比较人类与语言模型在复杂结构上的句子处理困难 (Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures)

Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

翻译：能够与人类流畅对话的大型语言模型已成为现实，但语言模型是否会经历类似人类的处理困难？我们在统一的实验框架下，系统比较了人类与语言模型在七种具有挑战性的语言结构上的句子理解能力。我们收集了人类以及五个系列、不同规模和训练流程的先进语言模型的句子理解数据。结果显示，语言模型在目标结构上普遍存在困难，尤其是在花园路径句上。事实上，尽管最强的模型在非花园路径结构上达到了接近完美的准确率（GPT-5为93.7%），但在花园路径结构上却表现不佳（GPT-5为46.8%）。此外，当基于平均性能对结构进行排序时，人类与模型之间的排序相关性随参数数量增加而增强。针对每个目标结构，我们还收集了其不含困难结构的匹配基线句数据。通过比较目标句与基线句的表现，人类观察到的性能差距在语言模型中同样存在，但有两个例外：对于能力过弱的模型，两种句子类型的表现均处于低水平；对于能力过强的模型，则均处于高水平。这些结果共同揭示了人类与语言模型在句子理解上的趋同与差异，为理解人类与语言模型的相似性提供了新的见解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/