This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions. We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills. Instead of directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial information in the form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual locomotion RL policy for execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes, low-level controls, and real-world robot experiments. We show more results at https://navila-bot.github.io/
翻译:本文旨在解决足式机器人的视觉与语言导航问题,该方法不仅为人类提供了一种灵活的指令方式,还使机器人能够在更具挑战性和杂乱的环境中导航。然而,将人类语言指令直接转化为底层的腿部关节动作并非易事。我们提出了NaVILA,这是一个双层框架,它将视觉-语言-动作模型与运动技能相统一。NaVILA并非直接从VLA预测底层动作,而是首先生成具有空间信息的中层动作,并以语言形式表达(例如,“向前移动75厘米”),该指令随后作为视觉运动强化学习策略的输入以执行动作。NaVILA在现有基准测试中显著改进了先前的方法。我们在新开发的IsaacLab基准测试中也展示了相同的优势,该测试具有更逼真的场景、底层控制以及真实机器人实验。更多结果请访问 https://navila-bot.github.io/