MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using three domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain, especially in long-horizon multimodal tasks. We apply reinforcement learning to improve entity tracking in open-source VLMs. This yields substantial in-modality gains, but does not transfer robustly across input modalities. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.

翻译：实体状态跟踪是世界建模的必要组成部分，要求随时间维持对实体的一致表征。以往研究仅基于纯文本任务对实体跟踪性能进行基准测试。我们提出多模态实体跟踪基准MET-Bench，旨在评估视觉语言模型跨模态跟踪实体状态的能力。通过三个领域的任务，我们考察了当前模型有效整合文本与图像状态更新的能力。实验发现，基于文本与基于图像的实体跟踪性能存在显著差距。我们通过实证表明，这种差异主要源于视觉推理能力的缺陷而非感知能力不足。进一步研究表明，显式的文本推理策略可提升性能，但在长时程多模态任务中仍存在局限。我们应用强化学习改进开源视觉语言模型的实体跟踪能力，虽在模态内取得显著提升，但未能实现输入模态间的稳健迁移。研究结果凸显了改进多模态表征与推理技术以弥合文本与视觉实体跟踪差距的必要性。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

综述：多模态遗忘方法、数据集与基准

专知会员服务

16+阅读 · 7月10日

【博士论文】弥合多模态基础模型与世界模型之间的鸿沟

专知会员服务

33+阅读 · 2025年10月9日

多模态推理的基础、方法与未来前沿

专知会员服务

27+阅读 · 2025年7月6日

大规模视觉-语言模型的基准、评估、应用与挑战

专知会员服务

18+阅读 · 2025年2月10日