Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang,Fei Xia,Wenhao Yu,Andy Zeng,Montserrat Gonzalez Arenas,Maria Attarian,Maria Bauza,Matthew Bennice,Alex Bewley,Adil Dostmohamed,Chuyuan Kelly Fu,Nimrod Gileadi,Marissa Giustina,Keerthana Gopalakrishnan,Leonard Hasenclever,Jan Humplik,Jasmine Hsu,Nikhil Joshi,Ben Jyenis,Chase Kew,Sean Kirmani,Tsang-Wei Edward Lee,Kuang-Huei Lee,Assaf Hurwitz Michaely,Joss Moore,Ken Oslund,Dushyant Rao,Allen Ren,Baruch Tabanpour,Quan Vuong,Ayzaan Wahid,Ted Xiao,Ying Xu,Vincent Zhuang,Peng Xu,Erik Frey,Ken Caluwaerts,Tingnan Zhang,Brian Ichter,Jonathan Tompson,Leila Takayama,Vincent Vanhoucke,Izhak Shafran,Maja Mataric,Dorsa Sadigh,Nicolas Heess,Kanishka Rao,Nik Stewart,Jie Tan,Carolina Parada

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

翻译：大型语言模型（LLMs）已被证明展现出广泛的能力，例如根据语言指令编写机器人代码——使非专家能够引导机器人行为、根据反馈修改行为或组合行为以执行新任务。然而，这些能力（由上下文学习驱动）仅限于短期交互，用户的反馈仅在符合LLM上下文长度时保持相关，并在较长交互过程中被遗忘。在本研究中，我们探索对编写机器人代码的LLMs进行微调，使其能够记忆上下文交互并提升其可教授性，即其适应人类输入的效率（以用户认为任务成功前的平均修正次数衡量）。我们的核心发现是：当人机交互被视为部分可观测马尔可夫决策过程（其中人类语言输入为观测值，机器人代码输出为动作）时，训练LLM完成历史交互即相当于训练一个转移动力学模型——该模型可与经典机器人技术（如模型预测控制（MPC））结合，以寻找更短的成功路径。由此我们提出了语言模型预测控制（LMPC）框架，该框架通过微调PaLM 2模型，在5种机器人形态的78项任务上提升了可教授性——将非专家对未见任务的教学成功率提升26.9%，同时将平均人工修正次数从2.4次降至1.9次。实验表明LMPC还能产生强大的元学习器，在未见过的机器人形态和API上，其新任务上下文学习的成功率提升31.5%。视频、代码及演示见：https://robot-teaching.github.io/。