We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.
翻译:我们提出语言反馈模型(Language Feedback Models, LFMs),用于识别指令遵循型模仿学习中的理想行为——即有助于完成指令所指定任务的动作。为训练LFM,我们从大型语言模型(LLMs)获取针对视觉轨迹(经语言描述转化为文本)的反馈。首先,通过使用LFM识别需要模仿的理想行为,我们在三个不同语言交互环境(Touchdown、ScienceWorld和ALFWorld)中,相比强行为克隆基线方法显著提升了任务完成率。其次,在控制LLM输出token数量的条件下,LFM在直接预测动作方面优于直接使用LLM作为专家模型的方法。第三,LFM能够泛化至未见环境,通过一轮自适应调整即可将任务完成率提升3.5%-12.0%。最后,LFM可被修改为提供人类可解释的反馈且不损失性能,从而支持对模仿学习中的理想行为进行人工验证。