Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.
翻译:关于语言模型认知合理性的研究迄今主要集中在模拟心理语言学反应变量(如阅读时间、注视时长及N400/P600脑电信号),而大多忽视了Mahowald等人(2023)所描述的正式与功能性语言能力以及发展合理性维度。为填补这一空白,我们在BabyLM严格版预训练语料库上训练了一系列不同规模的类GPT语言模型,并在挑战任务(BLiMP、GLUE、MSGS)及额外阅读时间预测任务中进行评估。研究发现,语言模型规模与三项挑战任务表现均呈正相关关系,但各任务对模型宽度与深度的偏好存在差异。与之相反,采用语言模型意外度作为预测因子的线性混合效应模型的阅读时间拟合度与语言模型规模呈负相关——规模次小的语言模型相较于不含意外度指标的基线模型实现了最大对数似然降低。这表明,模拟加工负荷与语言能力可能需要采取不同于在具发展合理性语料库上训练类GPT语言模型的研究路径。