Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate vision-language models' ability to track entity states across modalities. Using two structured domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain in long-horizon multimodal tasks. We develop a reinforcement learning method to improve performance on MET-Bench. Applying our method to open-source VLMs achieves competitive performance with advanced closed models. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.
翻译:实体状态追踪是世界建模的必要组成部分,要求随时间推移维持对实体的一致性表征。先前工作已在纯文本任务中对实体追踪性能进行了基准测试。本文提出MET-Bench,这是一个多模态实体追踪基准,旨在评估视觉语言模型跨模态追踪实体状态的能力。通过两个结构化领域,我们评估了当前模型整合文本与图像状态更新的有效性。研究发现,基于文本与基于图像的实体追踪之间存在显著性能差距。实验证明这种差异主要源于视觉推理而非感知能力的缺陷。进一步研究表明,显式的基于文本的推理策略能提升性能,但在长跨度多模态任务中仍存在局限。我们开发了一种强化学习方法以提升MET-Bench上的性能。将该方法应用于开源视觉语言模型后,其性能可与先进的闭源模型相竞争。研究结果凸显了改进多模态表征与推理技术以弥合文本与视觉实体追踪差距的必要性。