From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

翻译：大规模语言模型在一系列常用自然语言理解基准测试中展现出的能力正以惊人速度提升，这引发了诸多关于语言模型的“理解”究竟意味着什么、以及它与人类理解之间如何比较的问题。尤其值得注意的是，许多语言模型完全依赖文本训练，这使得人们质疑：它们卓越的基准测试表现是否真实反映了对问题本质的理解，抑或仅仅擅长产出与真正理解问题者所说话语形式相关的文本。在这项受哲学启发的研究中，我们旨在通过一系列测试在形式与意义之间建立某种分离——借鉴弗雷格式的意义概念，利用“对世界的理解应随同一意义的不同呈现方式保持一致”这一思想。具体而言，我们聚焦于跨语言一致性以及同义改写一致性。以GPT-3.5为研究对象，我们评估了其在五种不同语言及多种任务上的多义一致性。首先在受控环境中进行评测，要求模型回答简单事实问题，随后在四个主流自然语言理解基准测试上展开评估。结果发现，模型的多义一致性明显不足，我们通过多项后续分析验证了这种不一致性源于其对任务的感知依赖式理解。我们得出结论：在此方面，语言模型的理解远未达到人类式的一致性与稳定程度，并探讨了这一局限性对借助语言模型研究人类语言与理解的实际效用的影响。