We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data.
翻译:我们提出并部署了一种方法,通过用户在协作交互过程中提供的反馈,持续训练指令跟随智能体。在交互过程中,人类用户使用自然语言向智能体下达指令,并在观察智能体执行指令时提供实时二元反馈。我们设计了一种上下文老虎机学习方法,将用户反馈转化为即时奖励。通过数千次人机交互评估,我们证明了该方法的指令执行准确率随时间提升15.4%。此外,本研究还表明该方法对多种设计变体具有鲁棒性,且反馈信号与监督式示范数据的学习信号大致等价。