Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang,Fei Xia,Wenhao Yu,Andy Zeng,Montserrat Gonzalez Arenas,Maria Attarian,Maria Bauza,Matthew Bennice,Alex Bewley,Adil Dostmohamed,Chuyuan Kelly Fu,Nimrod Gileadi,Marissa Giustina,Keerthana Gopalakrishnan,Leonard Hasenclever,Jan Humplik,Jasmine Hsu,Nikhil Joshi,Ben Jyenis,Chase Kew,Sean Kirmani,Tsang-Wei Edward Lee,Kuang-Huei Lee,Assaf Hurwitz Michaely,Joss Moore,Ken Oslund,Dushyant Rao,Allen Ren,Baruch Tabanpour,Quan Vuong,Ayzaan Wahid,Ted Xiao,Ying Xu,Vincent Zhuang,Peng Xu,Erik Frey,Ken Caluwaerts,Tingnan Zhang,Brian Ichter,Jonathan Tompson,Leila Takayama,Vincent Vanhoucke,Izhak Shafran,Maja Mataric,Dorsa Sadigh,Nicolas Heess,Kanishka Rao,Nik Stewart,Jie Tan,Carolina Parada

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are formulated as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions can be viewed as training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

翻译：大型语言模型（LLM）展现出广泛的能力，例如根据语言指令编写机器人代码——使非专家能够指导机器人行为、基于反馈调整行为，或组合行为以完成新任务。然而，这些由上下文学习驱动的能力仅限于短期交互，用户反馈的相关性仅持续到超出LLM上下文长度为止，并在更长的交互过程中被遗忘。本研究探索对机器人代码编写LLM进行微调，使其能够记忆上下文交互并提升可教学性，即适应人类输入的效率（以用户判定任务成功前所需的平均修正次数衡量）。我们的关键观察是：当人机交互被形式化为部分可观测马尔可夫决策过程（其中人类语言输入为观测值，机器人代码输出为动作值）时，训练LLM完成历史交互可被视为训练转移动力学模型——该模型可与经典机器人技术（如模型预测控制MPC）结合，发现通往成功的更短路径。由此提出语言模型预测控制（LMPC）框架，对PaLM 2进行微调以提升其在5种机器人实体78项任务上的可教学性——将非专家对未见任务的教导成功率提升26.9%，同时将平均人类修正次数从2.4次降至1.9次。实验表明LMPC还能生成强大的元学习器，将基于上下文学习完成未见机器人实体和API新任务的成功率提升31.5%。视频、代码及演示见：https://robot-teaching.github.io/。