Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.
翻译:近期语言引导的视觉导航研究表明,可遍历环境的多样性与训练泛化智能体的监督信号规模之间存在显著需求。针对现有视觉-语言导航数据集普遍存在的数据稀缺问题,我们提出一种高效的数据生成范式,利用HM3D与Gibson数据集中1200余个照片级真实环境,通过互联网完全开放的资源合成了490万条指令-轨迹对。重要的是,我们系统探究了该范式中各组件对智能体性能的影响机制,并研究了如何充分运用增强数据对智能体进行预训练与微调。基于大规模数据集,仅通过简单的模仿学习即可将现有智能体的单次运行成功率显著提升至80%(相较此前最先进方法提升11%绝对值),在R2R测试集上创下新的性能巅峰。长期困扰该领域的可见/未见环境泛化差距也从8%(此前最优方法)降至不足1%。此外,该范式还助力不同模型在连续环境下的CVDN、REVERIE及R2R基准测试中实现最新先进导航成果。