The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing tasks. Despite recent progress on related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. To empower embodied agents with the ability to interact with humans, in this work, we propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task. We evaluate our model on the TEACH vision-dialog navigation and task completion dataset. Experimental results show that ELBA achieves improved task performance compared to baseline models without question-answering capabilities.
翻译:研究界对设计能够协助人类完成任务的智能具身体日益关注。尽管近期在相关视觉-语言基准测试中取得了进展,但先前的大多数工作聚焦于构建遵循指令的智能体,而非赋予智能体通过提问来主动解决具身环境中自然出现的歧义的能力。为赋予具身体与人类交互的能力,本文提出了一种具身式提问学习(ELBA)模型,该模型能够学习何时提问、提出何种问题,以动态获取完成任务的额外信息。我们在TEACH视觉对话导航与任务完成数据集上对模型进行了评估。实验结果表明,与不具备问答能力的基线模型相比,ELBA取得了更优的任务性能。