Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions obtained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of-the-art models alongside our baseline model with modularized perception and controls.
翻译:视觉语言导航(VLN)赋能的机器人具备通过理解视力正常路人提供的路线指令来引导盲人的潜力。这种能力使得机器人能够在通常先验未知的环境中运行。现有的VLN模型不足以应对盲人导航引导场景,因为它们需要理解来自人类记忆描述的路线——这类描述常包含口吃、错误和细节遗漏,而非如R2R数据集中通过出声思考获得的指令。然而,现有基准缺乏在自然环境中从人类记忆获取的指令。为此,我们提出了名为“记忆迷宫”的基准,该基准模拟了为引导盲人而寻求路线指令的场景。我们的基准包含一个迷宫式结构的虚拟环境以及来自人类记忆的新型路线指令数据。分析表明,从记忆收集的指令数据更长且用词更多样。通过评估最先进模型及我们采用模块化感知与控制机制的基线模型,我们进一步证明处理基于记忆的指令中的错误与歧义具有挑战性。