Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models.
翻译:语言引导的具身智能体基准测试通常假设指令为文本形式,但实际部署的智能体将面临语音指令。尽管自动语音识别(ASR)模型可以弥合输入差距,但错误的ASR转录会损害智能体完成任务的能力。在这项工作中,我们提出训练一种多模态ASR模型,通过考虑伴随的视觉上下文来减少转录语音指令中的错误。我们基于ALFRED任务完成数据集合成的语音指令数据集上训练模型,通过系统性地遮罩语音单词来模拟声学噪声。我们发现利用视觉观察有助于遮罩单词的恢复,多模态ASR模型比单模态基线模型最多可恢复30%的遮罩单词。我们还发现,文本训练的具身智能体遵循多模态ASR模型转录的指令后,成功完成任务的频率更高。