Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/
翻译:理解与推理空间关系是视觉问答(VQA)和机器人技术的一项基础能力。尽管视觉语言模型(VLM)在某些VQA基准测试中展现了卓越性能,但在三维空间推理方面仍存在不足,例如无法识别物理对象的数量关系(如距离或尺寸差异)。我们假设VLM空间推理能力受限的原因是训练数据中缺乏三维空间知识,并旨在通过使用互联网规模的空间推理数据训练VLM来解决这一问题。为此,我们提出了一套系统来推进这一方法。首先,我们开发了一个自动化三维空间VQA数据生成框架,该框架可在1000万张真实世界图像上扩展至20亿个VQA样本。随后,我们探究了训练方案中的多种因素,包括数据质量、训练流程和VLM架构。本研究首次提出了公制空间下的互联网规模三维空间推理数据集。通过在此类数据上训练VLM,我们显著提升了其在定性和定量空间VQA方面的能力。最后,我们证明,由于具备定量估计能力,该VLM在思维链空间推理和机器人技术中解锁了新颖的下游应用。项目网站:https://spatial-vlm.github.io/