Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems. Recent Vision-Language-Action (VLA) models have demonstrated strong navigation performance, but their high computational cost introduces latency that limits real-time deployment. We propose a training-free spatio-temporal vision token pruning framework tailored to VLA-based VLN. We apply spatial token selection to the current view, alongside spatio-temporal compression for historical memories, enabling efficient long-horizon inference while reducing redundant computation. Leveraging attention-based token importance and query-guided spatio-temporal filtering, the proposed approach preserves navigation-relevant information without retraining or modifying pretrained models, allowing plug-and-play integration into existing VLA systems. Through experiments on standard VLN benchmarks, we confirm that our method significantly outperforms existing pruning strategies. It successfully preserves superior navigation accuracy under extreme pruning scenarios, all while maintaining the highly competitive inference efficiency. Real-world deployment on a Unitree Go2 quadruped robot further validates reliable and low-latency instruction-following navigation under practical robotic constraints. We hope this work helps bridge the gap between large-scale multimodal modeling and efficient, real-time embodied deployment in robotic navigation systems.
翻译:视觉语言导航(VLN)使机器人能够在视觉接地的环境中遵循自然语言指令,是具身机器人系统的关键能力。最近的视觉语言动作(VLA)模型已展现出强大的导航性能,但其高昂的计算成本引入了延迟,限制了实时部署。我们提出了一种专为基于VLA的VLN设计的免训练时空视觉令牌剪枝框架。我们将空间令牌选择应用于当前视图,同时对历史记忆进行时空压缩,从而实现高效的长视野推理,同时减少冗余计算。该方法利用基于注意力的令牌重要性以及查询引导的时空过滤,在无需重新训练或修改预训练模型的情况下保留了与导航相关的信息,允许即插即用地集成到现有的VLA系统中。通过在标准VLN基准上的实验,我们证实了我们的方法显著优于现有的剪枝策略。它在极端剪枝场景下成功保持了卓越的导航精度,同时维持了极具竞争力的推理效率。在宇树Go2四足机器人上的真实世界部署进一步验证了在实际机器人约束下可靠且低延迟的指令跟随导航。我们希望这项工作有助于弥合机器人导航系统中大规模多模态建模与高效、实时具身部署之间的差距。