Multimodal language models (MLLMs) are increasingly being implemented in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions. We introduce Coarse Correspondence, a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs. Our method uses a lightweight tracking model to find object correspondences between frames in a video or between sets of image viewpoints. It selects the most frequent object instances and visualizes them with markers with unique IDs in the image. With this simple approach, we achieve state-of-the-art results on 3D understanding benchmarks including ScanQA (+20.5\%) and a subset of OpenEQA (+9.7\%), and on long-form video benchmarks such as EgoSchema (+6.0\%). We also curate a small diagnostic dataset to evaluate whether MLLMs can reason about space from a described viewpoint other than the camera viewpoint. Again, Coarse Correspondence improves spatial perspective-taking abilities but we highlight that MLLMs struggle with this task. Together, we demonstrate that our simple prompting method can significantly aid downstream tasks that require 3D or temporal reasoning.
翻译:多模态语言模型正日益应用于现实世界环境中,这要求其具备解析三维空间和理解时序动态的能力。尽管潜力巨大,当前我们领域内的顶尖模型在充分理解空间与时间维度方面仍存在不足。我们提出了粗粒度对应关系——一种简单、无需训练、高效且通用的视觉提示方法,旨在激发多模态大语言模型的三维与时间理解能力。该方法采用轻量级跟踪模型来寻找视频帧间或图像视角组间的物体对应关系,筛选出最高频的物体实例,并通过带有唯一ID的标记在图像中进行可视化。凭借这一简单方法,我们在三维理解基准测试(包括ScanQA(+20.5%)和OpenEQA子集(+9.7%))以及长视频基准测试(如EgoSchema(+6.0%))中取得了最先进的成果。我们还构建了一个小型诊断数据集,用于评估多模态语言模型能否从非相机视角的描述性视点进行空间推理。实验再次表明,粗粒度对应关系提升了空间视角采择能力,但我们也强调多模态语言模型在此类任务中仍面临挑战。综合而言,我们证明这种简单的提示方法能显著辅助需要三维或时序推理的下游任务。