Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5k 3D scans, and 3M annotated spatial relationships, and the pairing of 2D egocentric images with 3D scans makes it both 2D- and 3D- ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.
翻译:空间理解是机器人感知环境、进行环境推理并实现有意义交互的关键能力。在现代机器人学中,这些能力日益由视觉语言模型提供。然而,这些模型在空间推理任务中面临重大挑战,因为其训练数据基于通用图像数据集,往往缺乏精细的空间理解。例如,数据集通常未涵盖参照系理解,而有效的空间推理需要明确应从自我中心、世界中心还是物体中心视角进行推理。为解决此问题,我们提出了RoboSpatial——一个面向机器人学空间理解的大规模数据集。该数据集包含真实室内场景与桌面场景,以三维扫描和第一人称视角图像形式采集,并标注了与机器人学相关的丰富空间信息。数据集包含100万张图像、5000个三维扫描及300万个标注空间关系,其二维第一人称视角图像与三维扫描的配对设计使其同时适用于二维与三维任务。实验表明,使用RoboSpatial训练的模型在空间可供性预测、空间关系预测及机器人操作等下游任务中均优于基线模型。