There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
翻译:具身导航主要分为两类:视觉语言导航(VLN),其中智能体通过遵循自然语言指令进行导航;以及目标物体导航(OGN),其中智能体需导航至指定的目标物体。然而,现有研究主要评估模型在理想条件下的性能,忽视了现实场景中可能出现的输入干扰。为弥补这一空白,我们提出NavTrust——一个统一的基准测试框架,系统性地对RGB图像、深度信息和指令等输入模态在真实场景中施加干扰,并评估其对导航性能的影响。据我们所知,NavTrust是首个在统一框架下将具身导航智能体暴露于多种RGB-Depth干扰与指令变体中的基准测试。我们对七种前沿方法的广泛评估表明,在现实干扰下导航性能显著下降,揭示了关键的鲁棒性缺陷,并为构建更可靠的具身导航系统提供了指导方向。此外,我们系统评估了四种不同的缓解策略,以增强对RGB-Depth干扰和指令变体的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们将其部署于真实移动机器人上,观察到其对干扰的鲁棒性有所提升。项目网站:https://navtrust.github.io。