Embodied language comprehension emphasizes that language understanding is not solely a matter of mental processing in the brain but also involves interactions with the physical and social environment. With the explosive growth of Large Language Models (LLMs) and their already ubiquitous presence in our daily lives, it is becoming increasingly necessary to verify their real-world understanding. Inspired by cognitive theories, we propose POSQA: a Physical Object Size Question Answering dataset with simple size comparison questions to examine the extremity and analyze the potential mechanisms of the embodied comprehension of the latest LLMs. We show that even the largest LLMs today perform poorly under the zero-shot setting. We then push their limits with advanced prompting techniques and external knowledge augmentation. Furthermore, we investigate whether their real-world comprehension primarily derives from contextual information or internal weights and analyse the impact of prompt formats and report bias of different objects. Our results show that real-world understanding that LLMs shaped from textual data can be vulnerable to deception and confusion by the surface form of prompts, which makes it less aligned with human behaviours.
翻译:具身语言理解强调,语言理解不仅是大脑中的心理加工过程,还涉及与物理和社会环境的交互。随着大语言模型(LLMs)的爆发式增长及其在日常生活各领域的普遍应用,验证其对现实世界的理解变得日益必要。受认知理论启发,我们提出POSQA:一个包含简单大小比较问题的物理对象大小问答数据集,用于检验最新大语言模型具身理解的极端案例并分析其潜在机制。我们发现,即使当前最大的LLMs在零样本设置下表现欠佳。随后,我们通过先进提示技术和外部知识增强来突破其性能极限。此外,我们探究其真实世界理解主要来源于上下文信息还是内部权重,并分析提示格式的影响以及不同对象的报告偏差。结果表明,LLMs从文本数据中形成的真实世界理解易受提示表面形式的欺骗和混淆,使其与人类行为的一致性降低。