Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that increased model size may lead to better performance, but LLMs are still not sensitive to (un)grammaticality as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
翻译:理解语言的边界是大型语言模型(LLMs)成为自然语言理论基石的前提条件。LLMs在某些语言任务中的表现与人类存在量与质的差异,然而这种差异是否受模型规模影响尚待厘清。本研究探讨模型规模化的关键作用,考察规模增大能否弥合人类与模型之间的这类差异。我们测试了来自不同系列的三个LLMs(Bard,1370亿参数;ChatGPT-3.5,1750亿参数;ChatGPT-4,1.5万亿参数),在涉及照应、中心嵌入、比较级和负极性的语法判断任务中开展实验。收集了N=1,200个判断结果,从准确性、稳定性以及重复提示后准确性提升三个维度进行评分。将表现最佳的ChatGPT-4模型结果与80名人类被试在相同刺激下的结果进行对比。研究发现,增大模型规模可能提升性能,但LLMs仍无法像人类一样感知(非)语法性。单纯依赖规模化解决该问题虽非绝无可能,但希望渺茫。我们通过比较活体语言学习与硅基语言学习来诠释这些结果,揭示了三个关键差异:(i)证据类型,(ii)刺激的贫乏性,以及(iii)因不可穿透的语言指代而产生的语义幻觉现象。