The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing tasks. Although there have been significant advancements in related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. To address this gap, we propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task. We evaluate ELBA on the TEACh vision-dialog navigation and task completion dataset. Experimental results show that the proposed method achieves improved task performance compared to baseline models without question-answering capabilities.
翻译:研究界对设计能够协助人类完成任务的智能具身代理表现出日益浓厚的兴趣。尽管相关视觉语言基准测试已取得显著进展,但先前工作大多集中于构建遵循指令的代理,而非赋予代理主动提问以解决具身环境中自然产生的模糊性的能力。为弥补这一空白,我们提出了一种具身主动学习提问(ELBA)模型,该模型学习在何时提出何种问题,以动态获取完成任务所需的额外信息。我们在TEACh视觉对话导航与任务完成数据集上对ELBA进行了评估。实验结果表明,与不具备问答能力的基线模型相比,所提方法实现了更优的任务性能。