From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.

翻译：近期大规模视觉语言模型（LVLM）的进展提升了视觉语言理解能力，但这些模型在空间感知方面仍存在局限，制约了其对复杂三维场景的推理能力。与以往通过引入三维表征来增强空间理解的方法不同，本研究旨在通过利用具有空间相关性的图像数据来释放视觉语言模型（VLM）的潜力。为此，我们提出了一种基于三维真实场景数据构建的新型二维空间数据生成与标注流程。该流程支持创建从基础感知任务到复杂推理任务的多样化空间任务体系。依托此流程，我们构建了SPAR-7M数据集——一个通过整合多个公共数据集中数千个场景而生成的大规模数据集。此外，我们提出了SPAR-Bench基准测试平台，相比现有空间基准测试，该平台能提供更全面的空间能力评估，并同时支持单视角与多视角输入。通过在SPAR-7M及大规模二维数据集上进行训练，我们的模型在二维空间基准测试中达到了最先进的性能水平。进一步在三维任务专用数据集上进行微调后，模型取得了具有竞争力的结果，这充分证明了我们数据集在增强空间推理能力方面的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

多模态大型语言模型中的空间推理：任务、基准和方法综述

专知会员服务

23+阅读 · 2025年11月21日

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

30+阅读 · 2025年10月1日

视觉语言建模遇见遥感：模型、数据集与前景展望

专知会员服务

17+阅读 · 2025年5月21日

【ICML2025】《引入推理于视觉：通过模型融合理解感知与推理》

专知会员服务

17+阅读 · 2025年5月12日