Human intelligence's adaptability is remarkable, allowing us to adjust to new tasks and multi-modal environments swiftly. This skill is evident from a young age as we acquire new abilities and solve problems by imitating others or following natural language instructions. The research community is actively pursuing the development of interactive "embodied agents" that can engage in natural conversations with humans and assist them with real-world tasks. These agents must possess the ability to promptly request feedback in case communication breaks down or instructions are unclear. Additionally, they must demonstrate proficiency in learning new vocabulary specific to a given domain. In this paper, we made the following contributions: (1) a crowd-sourcing tool for collecting grounded language instructions; (2) the largest dataset of grounded language instructions; and (3) several state-of-the-art baselines. These contributions are suitable as a foundation for further research.
翻译:人类智能的适应性令人瞩目,使我们能够迅速适应新任务和多模态环境。这种能力从幼年时期便已显现——通过模仿他人或遵循自然语言指令,我们得以习得新技能并解决问题。当前研究领域正积极致力于开发能够与人类进行自然对话并协助完成现实任务的交互式"具身智能体"。这类智能体必须能在通信中断或指令不清晰时,及时请求反馈。此外,它们还需展现出在特定领域中习得新词汇的能力。本文贡献如下:(1)一种用于收集具身语言指令的众包工具;(2)最大规模的具身语言指令数据集;(3)多个最优基线模型。这些成果可为进一步研究奠定基础。