With their increasing size, large language models (LLMs) are becoming increasingly good at language understanding tasks. But even with high performance on specific downstream task, LLMs fail at simple linguistic tests for negation or quantifier understanding. Previous work on quantifier understanding in LLMs show inverse scaling in understanding few-type quantifiers. In this paper, we question the claims of of previous work and show that it is a result of inappropriate testing methodology. We also present alternate methods to measure quantifier comprehension in LLMs and show that LLMs are able to better understand the difference between the meaning of few-type and most-type quantifiers as their size increases, although they are not particularly good at it. We also observe inverse scaling for most-type quantifier understanding, which is contrary to human psycho-linguistic experiments and previous work, where the model's understanding of most-type quantifier gets worse as the model size increases. We do this evaluation on models ranging from 125M-175B parameters, which suggests that LLMs do not do as well as expected with quantifiers. We also discuss the possible reasons for this and the relevance of quantifier understanding in evaluating language understanding in LLMs.
翻译:随着模型规模的增大,大型语言模型在语言理解任务上的表现日益提升。然而,即使在下游具体任务中表现出色,大型语言模型在否定或量词理解等基础语言测试中仍存在不足。此前关于大型语言模型中量词理解的研究表明,其对"少量"类量词的理解呈现逆缩放现象。本文质疑了前人研究结论,并指出其源于不当测试方法。我们提出替代性方案来评估大型语言模型的量词理解能力,研究发现随着模型规模扩大,虽然表现并非特别优异,但模型对"少量"类与"多数"类量词含义的区分能力确实有所提升。值得注意的是,对"多数"类量词的理解反而出现逆缩放现象——这与人类心理语言学实验结果及前人研究相悖,即模型对"多数"类量词的理解能力随模型规模增加反而下降。我们对参数规模从1.25亿至1750亿的模型进行系统评估,结果表明大型语言模型对量词的理解远未达到预期水平。最后,本文探讨了量词理解能力欠佳的可能成因,并阐述了量词理解对评估大型语言模型语言理解能力的重要意义。