The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
翻译:学习指令引导视觉导航的学术领域通常可根据语言指令的粒度分为高层类别特定搜索和低层语言引导导航两大类,前者侧重于探索过程,而后者专注于遵循详细的文本命令。尽管这些任务的侧重点不同,但解释指令、理解环境及推断行动决策的核心需求始终一致。本文将多样化的导航任务整合到一个统一的通用框架中——我们研究了在导航学习中共享通用知识与利用任务特定能力的核心难点,并提出了一种新颖的状态自适应专家混合模型(SAME),该模型能有效使智能体基于不同粒度的语言指令和动态观测推断决策。借助SAME,我们构建了一个能够同时处理七种导航任务的多功能智能体,其性能优于或与任务专用智能体达到高度可比水平。