Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.
翻译:近年来,语言引导的视觉导航研究对可遍历环境的多样性以及训练泛化智能体所需的监督数据量提出了显著需求。针对现有视觉与语言导航数据集普遍存在的数据稀缺问题,我们提出了一种高效范式,用于生成大规模学习数据。该范式整合了来自HM3D和Gibson数据集的1200余种照片级真实环境,并利用互联网上完全可公开获取的资源合成了490万条指令-轨迹对。重要的是,我们研究了该范式中各组成部分对智能体性能的影响,并探讨了如何充分运用增强数据对智能体进行预训练与微调。得益于我们的大规模数据集,通过简单的模仿学习,现有智能体在R2R测试集上的单次运行成功率即可被提升至80%的新显著最佳水平(相比先前最先进技术提升了11%的绝对值)。同时,在已见与未见环境导航之间长期存在的泛化差距也缩小至不足1%(而此前最佳方法该差距为8%)。此外,我们的范式还助力不同模型在连续环境中的CVDN、REVERIE及R2R任务上取得了新的最先进导航结果。