This paper presents GRASP, a novel benchmark to evaluate the language grounding and physical understanding capabilities of video-based multimodal large language models (LLMs). This evaluation is accomplished via a two-tier approach leveraging Unity simulations. The initial level tests for language grounding by assessing a model's ability to relate simple textual descriptions with visual information. The second level evaluates the model's understanding of 'Intuitive Physics' principles, such as object permanence and continuity. In addition to releasing the benchmark, we use it to evaluate several state-of-the-art multimodal LLMs. Our evaluation reveals significant shortcomings in current models' language grounding and intuitive physics. These identified limitations underline the importance of benchmarks like GRASP to monitor the progress of future models in developing these competencies.
翻译:本文提出GRASP,一种评估基于视频的多模态大语言模型在语言基础与物理理解能力方面表现的新型基准。该评估通过基于Unity模拟的双层架构实现:第一层测试语言基础能力,评估模型将简单文本描述与视觉信息相关联的能力;第二层评估模型对物体恒存性与连续性等“直观物理”原理的理解。除发布该基准外,我们利用其对当前多种先进多模态大语言模型进行了评估。评估结果显示,现有模型在语言基础与直观物理理解方面存在显著缺陷。这些局限性凸显了像GRASP这样的基准对于监测未来模型能力发展进程的重要性。