We present a targeted, scaled-up comparison of incremental processing in humans and neural language models by collecting by-word reaction time data for sixteen different syntactic test suites across a range of structural phenomena. Human reaction time data comes from a novel online experimental paradigm called the Interpolated Maze task. We compare human reaction times to by-word probabilities for four contemporary language models, with different architectures and trained on a range of data set sizes. We find that across many phenomena, both humans and language models show increased processing difficulty in ungrammatical sentence regions with human and model `accuracy' scores (a la Marvin and Linzen(2018)) about equal. However, although language model outputs match humans in direction, we show that models systematically under-predict the difference in magnitude of incremental processing difficulty between grammatical and ungrammatical sentences. Specifically, when models encounter syntactic violations they fail to accurately predict the longer reaction times observed in the human data. These results call into question whether contemporary language models are approaching human-like performance for sensitivity to syntactic violations.
翻译:我们通过收集十六种不同句法测试套件中逐词反应时数据,对人和神经语言模型在增量处理方面进行了目标性的大规模比较。人类反应时数据来自一种名为插值迷宫任务的新型在线实验范式。我们将人类反应时与四种当代语言模型的逐词概率进行比较,这些模型采用不同架构并在多种规模的数据集上训练。研究发现,在众多句法现象中,人类和语言模型在非语法句区域均表现出处理难度增加,且人类与模型的“准确性”得分(类似Marvin和Linzen(2018)的方法)大致相当。然而,尽管语言模型输出的方向与人类一致,但模型系统性地低估了语法与非语法句子增量处理难度差异的幅度。具体而言,当模型遇到句法违反时,它们无法准确预测人类数据中观察到的较长反应时。这些结果对当代语言模型是否接近人类在句法违反敏感性方面的表现提出了质疑。