Language models~(LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is not yet clear whether LMs can understand physical concepts in the human world. To investigate this, we design a benchmark VEC that covers the tasks of (i) Visual concepts, such as the shape and material of objects, and (ii) Embodied Concepts, learned from the interaction with the world such as the temperature of objects. Our zero (few)-shot prompting results show that the understanding of certain visual concepts emerges as scaling up LMs, but there are still basic concepts to which the scaling law does not apply. For example, OPT-175B performs close to humans with a zero-shot accuracy of 85\% on the material concept, yet behaves like random guessing on the mass concept. Instead, vision-augmented LMs such as CLIP and BLIP achieve a human-level understanding of embodied concepts. Analysis indicates that the rich semantics in visual representation can serve as a valuable source of embodied knowledge. Inspired by this, we propose a distillation method to transfer embodied knowledge from VLMs to LMs, achieving performance gain comparable with that by scaling up the parameters of LMs 134x. Our dataset is available at \url{https://github.com/TobiasLee/VEC}
翻译:语言模型逐渐成为交互和具身化世界中的通用接口,而对物理概念的理解是其中的重要前提。然而,目前尚不清楚语言模型能否理解人类世界中的物理概念。为探究这一问题,我们设计了基准测试VEC,涵盖以下任务:(i)视觉概念,例如物体的形状和材质;(ii)具身概念,通过与世界的交互习得,例如物体的温度。零样本(少样本)提示结果表明,随着语言模型规模扩大,某些视觉概念的理解会涌现,但仍存在一些基本概念不符合缩放定律。例如,OPT-175B在材质概念上零样本准确率达到85%,接近人类水平,但在质量概念上却表现如同随机猜测。相反,视觉增强型语言模型(如CLIP和BLIP)在具身概念上达到了人类水平的理解。分析表明,视觉表征中的丰富语义可作为具身知识的宝贵来源。受此启发,我们提出一种蒸馏方法,将具身知识从视觉语言模型(VLM)迁移至语言模型,其性能提升相当于将语言模型参数扩大134倍。我们的数据集可访问:\url{https://github.com/TobiasLee/VEC}