Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.
翻译:近年来,视觉与语言导航(VLN)领域受到越来越多的关注,众多方法不断涌现以推动其发展。基础模型的显著成就重塑了VLN研究的挑战与解决方案。本综述采用一种自上而下的视角,基于具身规划与推理的原则性框架,重点探讨当前利用基础模型应对VLN挑战的方法及未来机遇。我们希望通过深入分析提供有价值的资源与见解:一方面,系统梳理该领域进展,探索基础模型在其中的潜在作用与发展机遇;另一方面,为关注基础模型的研究者厘清VLN领域的不同挑战及其对应解决方案。