With the recent rise of Large Language Models (LLMs), Vision-Language Models (VLMs), and other general foundation models, there is growing potential for multimodal, multi-task embodied agents that can operate in diverse environments given only natural language as input. One such application area is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the spatial reasoning and semantic understanding required, particularly in arbitrary scenes that may contain many objects belonging to fine-grained classes. To address this challenge, we curate the largest real-world dataset for Vision and Language-guided Action in 3D Scenes (VLA-3D), consisting of over 11.5K scanned 3D indoor rooms from existing datasets, 23.5M heuristically generated semantic relations between objects, and 9.7M synthetically generated referential statements. Our dataset consists of processed 3D point clouds, semantic object and room annotations, scene graphs, navigable free space annotations, and referential language statements that specifically focus on view-independent spatial relations for disambiguating objects. The goal of these features is to aid the downstream task of navigation, especially on real-world systems where some level of robustness must be guaranteed in an open world of changing scenes and imperfect language. We benchmark our dataset with current state-of-the-art models to obtain a performance baseline. All code to generate and visualize the dataset is publicly released, see https://github.com/HaochenZ11/VLA-3D. With the release of this dataset, we hope to provide a resource for progress in semantic 3D scene understanding that is robust to changes and one which will aid the development of interactive indoor navigation systems.
翻译:随着大型语言模型、视觉语言模型及其他通用基础模型的兴起,仅以自然语言为输入、能在多样环境中运行的多模态多任务具身智能体展现出日益增长的应用潜力。基于自然语言指令的室内导航便是其中一个应用领域。然而,尽管近期取得了一定进展,该问题仍具挑战性,因其需要空间推理与语义理解能力,尤其是在可能包含大量细粒度类别物体的任意场景中。为应对这一挑战,我们构建了当前规模最大的真实世界三维场景视觉与语言引导行动数据集,包含来自现有数据集的超过11.5K个扫描三维室内房间、2,350万条启发式生成的物体间语义关系以及970万条合成生成的指称语句。本数据集包含经处理的三维点云、语义物体与房间标注、场景图、可导航自由空间标注以及专门关注视角无关空间关系以消除物体歧义的指称语言描述。这些特征旨在辅助下游导航任务,特别是在真实世界系统中,面对不断变化的开放场景和不完善的语言输入时,必须保证一定程度的鲁棒性。我们使用当前最先进的模型对本数据集进行基准测试以获取性能基线。生成和可视化数据集的全部代码均已公开,详见https://github.com/HaochenZ11/VLA-3D。通过发布此数据集,我们希望为语义三维场景理解领域提供一个能适应场景变化、促进交互式室内导航系统发展的研究资源。