Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using three domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain, especially in long-horizon multimodal tasks. We apply reinforcement learning to improve entity tracking in open-source VLMs. This yields substantial in-modality gains, but does not transfer robustly across input modalities. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.
翻译:实体状态跟踪是世界建模的必要组成部分,要求随时间维持对实体的一致表征。以往研究仅基于纯文本任务对实体跟踪性能进行基准测试。我们提出多模态实体跟踪基准MET-Bench,旨在评估视觉语言模型跨模态跟踪实体状态的能力。通过三个领域的任务,我们考察了当前模型有效整合文本与图像状态更新的能力。实验发现,基于文本与基于图像的实体跟踪性能存在显著差距。我们通过实证表明,这种差异主要源于视觉推理能力的缺陷而非感知能力不足。进一步研究表明,显式的文本推理策略可提升性能,但在长时程多模态任务中仍存在局限。我们应用强化学习改进开源视觉语言模型的实体跟踪能力,虽在模态内取得显著提升,但未能实现输入模态间的稳健迁移。研究结果凸显了改进多模态表征与推理技术以弥合文本与视觉实体跟踪差距的必要性。