With their increasing size, large language models (LLMs) are becoming increasingly good at language understanding tasks. But even with high performance on specific downstream task, LLMs fail at simple linguistic tests for negation or quantifier understanding. Previous work on quantifier understanding in LLMs show inverse scaling in understanding few-type quantifiers. In this paper, we question the claims of of previous work and show that it is a result of inappropriate testing methodology. We also present alternate methods to measure quantifier comprehension in LLMs and show that LLMs are able to better understand the difference between the meaning of few-type and most-type quantifiers as their size increases, although they are not particularly good at it. We also observe inverse scaling for most-type quantifier understanding, which is contrary to human psycho-linguistic experiments and previous work, where the model's understanding of most-type quantifier gets worse as the model size increases. We do this evaluation on models ranging from 125M-175B parameters, which suggests that LLMs do not do as well as expected with quantifiers. We also discuss the possible reasons for this and the relevance of quantifier understanding in evaluating language understanding in LLMs.
翻译:随着模型规模增大,大型语言模型(LLMs)在语言理解任务中表现日益提升。然而,即便在特定下游任务中取得高性能,LLMs在否定词或量词理解的简单语言学测试中仍会失败。以往关于LLMs量词理解的研究表明,其对"少数类"量词的理解呈现逆向缩放现象。本文质疑前人研究的结论,指出这源于不当的测试方法。我们提出替代方案来评估LLMs的量词理解能力,发现随着模型规模增长,LLMs虽未能达到理想水平,但确实能更好地区分"少数类"与"多数类"量词的含义差异。值得注意的是,我们观察到"多数类"量词理解出现逆向缩放——这与人类心理语言学实验及前人研究相反,即模型对"多数类"量词的理解随规模增大反而恶化。我们基于参数规模从125M到175B的模型进行评估,结果表明LLMs对量词的理解能力未达预期水平。本文还探讨了导致该现象的可能原因,以及量词理解对评估LLMs语言理解能力的重要意义。