How well do language models deal with quantification? In this study, we focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models because the sentence components with out the quantifier are likely to co-occur, and 'few'-type quantifiers are rare. We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do all the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. This inverse scaling is consistent with previous work suggesting that larger models increasingly reflect online rather than offline human processing, and we argue that the decreasing performance of larger models may challenge uses of language models as the basis for natural language systems.
翻译:语言模型处理量化表达的能力如何?本研究聚焦于“少量”类量词(如“少数孩子喜欢玩具”),这类表达对语言模型构成特殊挑战,因为不带量词的句子成分可能更常共现,且“少量”类量词本身较为罕见。我们采用来自两项人类神经语言学实验的960个英语句子刺激,对22个不同规模的回归变压器模型进行测试。结果发现,所有模型在“少量”类量词上表现均不佳,且模型规模越大,其表现反而越差。这种逆缩放现象与先前研究一致,表明大型模型更倾向于反映人类的在线而非离线处理过程。我们认为,大型模型性能的持续下降可能挑战将语言模型作为自然语言系统基础的应用前景。