Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
翻译:近期大规模视觉语言模型(LVLM)的进展提升了视觉语言理解能力,但这些模型在空间感知方面仍存在局限,制约了其对复杂三维场景的推理能力。与以往通过引入三维表征来增强空间理解的方法不同,本研究旨在通过利用具有空间相关性的图像数据来释放视觉语言模型(VLM)的潜力。为此,我们提出了一种基于三维真实场景数据构建的新型二维空间数据生成与标注流程。该流程支持创建从基础感知任务到复杂推理任务的多样化空间任务体系。依托此流程,我们构建了SPAR-7M数据集——一个通过整合多个公共数据集中数千个场景而生成的大规模数据集。此外,我们提出了SPAR-Bench基准测试平台,相比现有空间基准测试,该平台能提供更全面的空间能力评估,并同时支持单视角与多视角输入。通过在SPAR-7M及大规模二维数据集上进行训练,我们的模型在二维空间基准测试中达到了最先进的性能水平。进一步在三维任务专用数据集上进行微调后,模型取得了具有竞争力的结果,这充分证明了我们数据集在增强空间推理能力方面的有效性。