Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://yanweiw.github.io/glide
翻译:将大型语言模型(LLMs)的常识推理具体化到物理领域,仍然是具身人工智能中关键但未解决的问题。先前的研究主要集中于直接利用LLM在符号空间中进行规划,而本工作则利用LLM引导搜索多步骤演示中隐含的任务结构和约束。具体而言,我们从操作规划文献中借鉴了“模式族”(mode families)的概念,该概念通过特定的运动约束对机器人配置进行分组,从而作为LLM高层语言表示与机器人底层物理轨迹之间的抽象层。通过重放少量人工演示并施加合成扰动,我们不仅生成了覆盖演示状态空间的额外成功执行案例,还生成了导致任务失败的反事实案例。我们的基于解释的学习框架训练了一个端到端可微神经网络,从失败中预测成功轨迹,并作为副产品学习了无需密集标注即可将底层状态和图像具体化为模式族的分类器。这些学习到的具体化分类器可进一步用于将语言计划以可解释的方式转化为物理领域中的反应式策略。通过二维导航以及模拟和真实机器人操作任务,我们展示了该方法提升了模仿学习的可解释性和反应性。网站:https://yanweiw.github.io/glide