Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.
翻译:能够与人类流畅对话的大型语言模型已成为现实,但语言模型是否会经历类似人类的处理困难?我们在统一的实验框架下,系统比较了人类与语言模型在七种具有挑战性的语言结构上的句子理解能力。我们收集了人类以及五个系列、不同规模和训练流程的先进语言模型的句子理解数据。结果显示,语言模型在目标结构上普遍存在困难,尤其是在花园路径句上。事实上,尽管最强的模型在非花园路径结构上达到了接近完美的准确率(GPT-5为93.7%),但在花园路径结构上却表现不佳(GPT-5为46.8%)。此外,当基于平均性能对结构进行排序时,人类与模型之间的排序相关性随参数数量增加而增强。针对每个目标结构,我们还收集了其不含困难结构的匹配基线句数据。通过比较目标句与基线句的表现,人类观察到的性能差距在语言模型中同样存在,但有两个例外:对于能力过弱的模型,两种句子类型的表现均处于低水平;对于能力过强的模型,则均处于高水平。这些结果共同揭示了人类与语言模型在句子理解上的趋同与差异,为理解人类与语言模型的相似性提供了新的见解。