Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.
翻译:视觉与语言导航(VLN)要求机器人遵循自然语言指令,在无先验地图的复杂环境中进行导航。尽管近期视觉语言大模型展现出强大的推理能力,但在VLN任务中往往表现不及任务专用的全景小模型。为解决这一问题,我们提出CLASH(协作式大-小层级框架),这是一个VLN-CE框架,将反应式小模型规划器(RSMP)与反思式大模型推理器(RLMR)相结合。RSMP采用基于因果学习的双分支架构以增强泛化能力,而RLMR则利用全景视觉提示与思维链推理,支持可解释的空间理解与导航。我们进一步引入一种不确定性感知协作机制(UCM),自适应地融合两个模型的决策。针对避障问题,在仿真环境中,我们将基于规则的控制器替换为完全可学习的点目标策略;在实际部署中,我们设计了一个基于LiDAR的聚类模块用于生成可导航路径点,并将其与基于在线SLAM的局部控制器配对使用。CLASH在VLN-CE排行榜上取得了最先进的成果(排名第一),在测试未见集上的SR与SPL指标较先前最优方法有显著提升。真实世界实验验证了CLASH强大的鲁棒性,证明了其在仿真与部署场景中的有效性。