Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This paper looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model (thinking vs. instruct) and the language under investigation. Results show that although thinking models showed a high accuracy in the numerosity estimation task and in the organization of quantifiers into scales, there are still key differences between humans and LLMs across all model types, particularly in terms of ranges of use and prototypicality values. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.
翻译:量化现象已被证明是多模态大语言模型(MLLMs)面临的一项尤为困难的 linguistic 挑战。然而,由于量化涉及逻辑、语用与数值域的交叉,其性能不佳的确切原因尚不明确。本文聚焦人类量化语言中跨语言共有的三个关键特征——量级排序、使用范围与原型性、以及人类近似数字系统的固有偏差——这些特征在现有(M)LLM研究中尚未得到充分探讨。研究旨在确定这些特征如何编码于模型架构中、与人类有何差异、以及结果是否受模型类型(推理型vs指令型)和目标语言的影响。实验结果表明,尽管推理型模型在数量估计任务和量级排序方面表现出较高准确性,但所有模型类型与人类之间仍存在关键差异,尤其体现在使用范围和原型性值方面。本研究为探讨MLLMs作为语义语用代理的本质开辟了新路径,同时跨语言视角可揭示其能力在不同语言中的鲁棒性与稳定性。