UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

Smart autonomous agents are becoming increasingly important in various real-life applications, including robotics and autonomous vehicles. One crucial skill that these agents must possess is the ability to interact with their surrounding entities, such as other agents or humans. In this work, we aim at building an intelligent agent that can efficiently navigate in an environment while being able to interact with an oracle (or human) in natural language and ask for directions when it is unsure about its navigation performance. The interaction is started by the agent that produces a question, which is then answered by the oracle on the basis of the shortest trajectory to the goal. The process can be performed multiple times during navigation, thus enabling the agent to hold a dialogue with the oracle. To this end, we propose a novel computational model, named UNMuTe, that consists of two main components: a dialogue model and a navigator. Specifically, the dialogue model is based on a GPT-2 decoder that handles multimodal data consisting of both text and images. First, the dialogue model is trained to generate question-answer pairs: the question is generated using the current image, while the answer is produced leveraging future images on the path toward the goal. Subsequently, a VLN model is trained to follow the dialogue predicting navigation actions or triggering the dialogue model if it needs help. In our experimental analysis, we show that UNMuTe achieves state-of-the-art performance on the main navigation tasks implying dialogue, i.e. Cooperative Vision and Dialogue Navigation (CVDN) and Navigation from Dialogue History (NDH), proving that our approach is effective in generating useful questions and answers to guide navigation.

翻译：智能自主代理在机器人学和自动驾驶车辆等多种现实应用中正变得日益重要。这些代理必须具备的一项关键技能是与周围实体（如其他代理或人类）进行交互的能力。在本研究中，我们致力于构建一种智能代理，使其能够在环境中高效导航的同时，能够以自然语言与先知（或人类）交互，并在不确定自身导航表现时询问方向。交互由代理生成问题而启动，随后先知根据到达目标的最短轨迹进行回答。该过程可在导航过程中多次执行，从而使代理能够与先知保持对话。为此，我们提出了一种名为UNMuTe的新型计算模型，该模型包含两个主要组件：对话模型和导航器。具体而言，对话模型基于GPT-2解码器构建，可处理包含文本和图像的多模态数据。首先，对话模型被训练用于生成问答对：问题利用当前图像生成，而答案则通过利用通往目标路径上的未来图像产生。随后，训练视觉语言导航模型以遵循对话预测导航动作，或在需要帮助时触发对话模型。在我们的实验分析中，我们证明UNMuTe在涉及对话的主要导航任务——即协作视觉与对话导航和基于对话历史的导航——上实现了最先进的性能，这表明我们的方法在生成有用问答以指导导航方面具有显著效果。