Probing Quantifier Comprehension in Large Language Models

With their increasing size, Large language models (LLMs) are becoming increasingly good at language understanding tasks. But even with high performance on specific downstream task, LLMs fail at simple linguistic tests for negation or quantifier understanding. Previous work on testing capability of LLMs on understanding quantifiers suggest that as the size of the models increase, they get better at understanding most-type quantifiers but get increasingly worse at understanding few-type quantifiers, thus presenting a case of an inverse-scaling law. In this paper, we question the claims of inverse scaling of few-type quantifier understanding in LLMs and show that it is a result of inappropriate testing methodology. We also present alternate methods to measure quantifier comprehension in LLMs and show that as the size of the models increase, these behaviours are different from what is shown in previous research. LLMs are consistently able to understand the difference between the meaning of few-type and most-type quantifiers, but when a quantifier is added to phrase, LLMs do not always take into account the meaning of the quantifier. We in fact see an inverse scaling law for most-type quantifiers, which is contrary to human psycho-linguistic experiments and previous work, where the model's understanding of most-type quantifier gets worse as the model size increases. We do this evaluation on models ranging from 125M-175B parameters, which suggests that LLMs do not do as well as expected with quantifiers and statistical co-occurrence of words still takes precedence over word meaning.

翻译：随着模型规模的增大，大型语言模型在语言理解任务中的表现日益提升。然而，即便在特定下游任务上取得优异性能，大型语言模型在否定或量词理解的简单语言测试中仍会失败。先前关于大型语言模型量词理解能力的研究表明，随着模型规模增大，模型在"大多数"类量词的理解上表现更好，但对"少数"类量词的理解却越来越差，从而呈现出逆缩放定律现象。本文质疑了大型语言模型中"少数"类量词理解存在逆缩放定律的论断，并证明这是测试方法不当所致。我们提出了替代方法来衡量大型语言模型的量词理解能力，结果表明，随着模型规模增大，这些行为模式与以往研究结果不同。大型语言模型能够持续区分"少数"类与"大多数"类量词的语义差异，但当量词附加到短语中时，模型并不总能充分考虑量词的语义。事实上，我们发现"大多数"类量词存在逆缩放定律，这与人类心理语言学实验及先前研究相悖——随着模型规模增大，模型对"大多数"类量词的理解反而变差。我们在125M至175B参数规模的模型上进行了评估，结果表明大型语言模型在量词理解上的表现未达预期，词汇的统计共现性仍优先于词义理解。