The ability to assist humans during a navigation task in a supportive role is crucial for intelligent agents. Such agents, equipped with environment knowledge and conversational abilities, can guide individuals through unfamiliar terrains by generating natural language responses to their inquiries, grounded in the visual information of their surroundings. However, these multimodal conversational navigation helpers are still underdeveloped. This paper proposes a new benchmark, Respond to Help (R2H), to build multimodal navigation helpers that can respond to help, based on existing dialog-based embodied datasets. R2H mainly includes two tasks: (1) Respond to Dialog History (RDH), which assesses the helper agent's ability to generate informative responses based on a given dialog history, and (2) Respond during Interaction (RdI), which evaluates the helper agent's ability to maintain effective and consistent cooperation with a task performer agent during navigation in real-time. Furthermore, we propose a novel task-oriented multimodal response generation model that can see and respond, named SeeRee, as the navigation helper to guide the task performer in embodied tasks. Through both automatic and human evaluations, we show that SeeRee produces more effective and informative responses than baseline methods in assisting the task performer with different navigation tasks. Project website: https://sites.google.com/view/respond2help/home.
翻译:在导航任务中以支持角色协助人类的能力对智能体至关重要。这类智能体具备环境知识与对话能力,能基于周围环境的视觉信息生成自然语言回应,引导个体穿越陌生地形。然而,这些多模态对话式导航助手仍不完善。本文提出一个新的基准——响应求助(R2H),旨在基于现有基于对话的具体化数据集,构建能响应求助的多模态导航助手。R2H主要包含两个任务:(1)响应对话历史(RDH),评估辅助智能体根据给定对话历史生成信息丰富回应的能力;(2)交互中响应(RdI),评估辅助智能体在实时导航过程中与任务执行智能体保持有效且一致协作的能力。此外,我们提出一种新颖的面向任务的多模态回应生成模型SeeRee(可见且可应),作为导航助手引导任务执行者完成具体化任务。通过自动评估与人工评估,我们证明SeeRee在协助任务执行者完成不同导航任务时,能比基线方法生成更有效且信息更丰富的回应。项目网站:https://sites.google.com/view/respond2help/home。