自优化数据飞轮驱动的语言引导导航学习 (Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel)

Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data self-refining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases.

翻译：为训练鲁棒的语言指令智能体创建高质量数据是具身人工智能领域一个长期存在的挑战。本文提出了一种自优化数据飞轮，它通过指令生成器和导航器两个模型之间的协作，迭代优化数据池，无需任何人工标注，即可生成高质量、大规模的导航指令-轨迹对。具体而言，SRDF首先使用一个基础生成器创建一个初始数据池来训练基础导航器，随后应用训练好的导航器来过滤数据池。这产生了保真度更高的数据，用于训练更好的生成器，而更好的生成器反过来又能为下一轮导航器的训练生成更高质量的数据。这种飞轮机制建立了一个数据自优化过程，为大规模语言引导导航学习持续产出不断改进且高效的数据集。我们的实验表明，经过几轮飞轮迭代后，导航器在经典R2R测试集上的性能边界从70% SPL提升至78%，首次超越了人类性能（76%）。同时，该过程也产生了一个更优的生成器，其SPICE分数从23.5提升至26.2，优于所有先前的VLN指令生成方法。最后，我们通过增加环境和指令多样性证明了方法的可扩展性，并展示了预训练导航器在各种下游导航任务上的泛化能力，在所有情况下均大幅超越现有最先进方法。