This paper presents GRASP, a novel benchmark to evaluate the language grounding and physical understanding capabilities of video-based multimodal large language models (LLMs). This evaluation is accomplished via a two-tier approach leveraging Unity simulations. The first level tests for language grounding by assessing a model's ability to relate simple textual descriptions with visual information. The second level evaluates the model's understanding of "Intuitive Physics" principles, such as object permanence and continuity. In addition to releasing the benchmark, we use it to evaluate several state-of-the-art multimodal LLMs. Our evaluation reveals significant shortcomings in the language grounding and intuitive physics capabilities of these models. Although they exhibit at least some grounding capabilities, particularly for colors and shapes, these capabilities depend heavily on the prompting strategy. At the same time, all models perform below or at the chance level of 50% in the Intuitive Physics tests, while human subjects are on average 80% correct. These identified limitations underline the importance of using benchmarks like GRASP to monitor the progress of future models in developing these competencies.
翻译:本文提出GRASP,一个用于评估基于视频的多模态大语言模型(LLMs)的语言锚定与物理理解能力的新型基准。该评估通过基于Unity模拟的两层方法实现:第一层通过评估模型将简单文本描述与视觉信息关联的能力来测试语言锚定;第二层评估模型对“直觉物理”原理(如物体恒存性与连续性)的理解。除发布该基准外,我们利用其对多个最先进的多模态LLM进行了评估。评估揭示了这些模型在语言锚定与直觉物理能力方面的显著缺陷:尽管它们表现出一定的锚定能力(特别是对颜色和形状),但这种能力高度依赖于提示策略;与此同时,所有模型在直觉物理测试中的表现均低于或处于50%的随机水平,而人类受试者的平均正确率达80%。这些发现的局限性凸显了使用GRASP等基准监测未来模型能力发展进程的重要性。