There has been considerable interest in using surprisal from Transformer-based language models (LMs) as predictors of human sentence processing difficulty. Recent work has observed an inverse scaling relationship between Transformers' per-word estimated probability and the predictive power of their surprisal estimates on reading times, showing that LMs with more parameters and trained on more data are less predictive of human reading times. However, these studies focused on predicting latency-based measures. Tests on brain imaging data have not shown a trend in any direction when using a relatively small set of LMs, leaving open the possibility that the inverse scaling phenomenon is constrained to latency data. This study therefore conducted a more comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two functional magnetic resonance imaging (fMRI) datasets. Results show that the inverse scaling relationship between models' per-word estimated probability and model fit on both datasets still obtains, resolving the inconclusive results of previous work and indicating that this trend is not specific to latency-based measures.
翻译:使用基于Transformer的语言模型(LM)的惊奇度作为人类句子加工难度的预测因子已引起广泛关注。近期研究发现,Transformer模型对每个单词的估计概率与其惊奇度估计对阅读时间的预测能力之间存在反向缩放关系,表明参数更多、训练数据更丰富的语言模型对人类阅读时间的预测能力反而更差。然而,这些研究主要聚焦于预测基于延迟的测量指标。在使用相对少量语言模型对脑成像数据进行的测试中,并未发现任何方向的趋势,这使得反向缩放现象可能仅限于延迟数据的可能性依然存在。因此,本研究使用来自三个不同语言模型家族的17个预训练模型在两个功能磁共振成像(fMRI)数据集上进行了更全面的评估。结果表明,模型对每个单词的估计概率与模型在两个数据集上的拟合度之间仍然存在反向缩放关系,这解决了先前研究的不确定性结论,并表明该趋势并非仅限于基于延迟的测量指标。