Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.
翻译:视觉语言模型(VLMs)在二维图像理解方面已取得显著进展,但在作为具身人工智能基础的空间理解任务上仍面临挑战。本文提出SpatialBot模型,通过同时输入RGB图像与深度图像以提升空间理解能力。此外,我们构建了包含多层次深度关联问题的SpatialQA数据集,用于训练视觉语言模型的深度感知能力。最后,我们提出SpatialBench评估体系,系统化评测视觉语言模型在不同层次的空间理解能力。在空间理解基准测试、通用视觉语言模型基准以及具身人工智能任务上的大量实验表明,基于SpatialQA训练的SpatialBot模型取得了显著性能提升。模型、代码及数据已发布于https://github.com/BAAI-DCAI/SpatialBot。