Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.
翻译:视觉语言模型(VLM)在二维图像理解方面已取得显著性能,但其在空间理解方面仍存在困难,而空间理解是具身人工智能的基础。本文提出SpatialBot,通过同时输入RGB图像与深度图像以实现更优的空间理解。此外,我们构建了SpatialQA数据集,其中包含多层次的深度相关问题,用于训练VLM的深度理解能力。最后,我们提出SpatialBench基准测试,以全面评估VLM在不同层次的空间理解能力。在空间理解基准、通用VLM基准以及具身人工智能任务上的大量实验表明,基于SpatialQA训练的SpatialBot取得了显著提升。模型、代码与数据已发布于https://github.com/BAAI-DCAI/SpatialBot。