Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action prediction as multimodal text generation. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks. Different to previous modular approaches with independently trained components, we use a single multitask model where each task contributes to goal completion. EMMA performs on par with similar models on several VL benchmarks and sets a new state-of-the-art performance (36.81% success rate) on the Dialog-guided Task Completion (DTC), a benchmark to evaluate dialog-guided agents in the Alexa Arena
翻译:交互式与具身任务对现有视觉语言(Vision & Language,VL)模型至少提出了两个根本性挑战:1)将语言锚定于动作与观测的轨迹中;2)指代消歧。为应对这些挑战,我们提出了具身多模态代理(Embodied MultiModal Agent,EMMA):一个统一的编码器-解码器模型,能够对图像与轨迹进行推理,并将动作预测转化为多模态文本生成。通过将所有任务统一为文本生成,EMMA学习了一种动作语言,从而促进任务间的迁移。与以往采用独立训练组件的模块化方法不同,我们使用单一的多任务模型,其中每个任务均有助于目标完成。EMMA在多个VL基准测试中表现与同类模型相当,并在Alexa Arena中评估对话引导代理的对话式任务完成(Dialog-guided Task Completion,DTC)基准测试中取得了新的最佳性能(成功率36.81%)。