Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.
翻译:视觉与语言导航(VLN)受限于训练数据的有限多样性和规模,这主要源于现有模拟器的手动构建。为解决此问题,我们提出了RoomTour3D,一个源自网络房间导览视频的视频-指令数据集,这些视频捕捉了真实世界的室内空间和人类行走演示。与现有VLN数据集不同,RoomTour3D利用在线视频的规模和多样性来生成开放式的人类行走轨迹和开放世界的可导航指令。为弥补在线视频中导航数据的不足,我们进行了三维重建,获取了行走路径的三维轨迹,并补充了房间类型、物体位置以及周围场景三维形状等额外信息。我们的数据集包含来自1847个房间导览环境的约10万条开放式描述增强轨迹(附带约20万条指令)和1.7万条动作增强轨迹。实验证明,RoomTour3D能在多个VLN任务(包括CVDN、SOON、R2R和REVERIE)上实现显著性能提升。此外,RoomTour3D促进了可训练的零样本VLN智能体的开发,展示了迈向开放世界导航的潜力与挑战。