Grounding the common-sense reasoning of Large Language Models in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://sites.google.com/view/grounding-plans
翻译:将大型语言模型的常识推理扎根于物理领域,对于具身人工智能而言仍是一个关键但未解决的问题。尽管先前的工作主要集中于在符号空间中直接利用大型语言模型进行规划,但本研究利用大型语言模型来指导多步演示中隐含的任务结构和约束搜索。具体而言,我们从操作规划文献中借用了模式族的概念,该概念通过特定运动约束对机器人配置进行分组,作为大型语言模型的高级语言表示与机器人的低级物理轨迹之间的抽象层。通过重放少量人类演示并施加合成扰动,我们覆盖了演示的状态空间,生成了额外成功的执行以及导致任务失败的反事实。我们的基于解释的学习框架训练了一个端到端可微分的神经网络来从失败中预测成功轨迹,并作为副产品学习了分类器,这些分类器在无需密集标注的情况下将低级状态和图像扎根于模式族中。学习到的扎根分类器可进一步用于将语言计划转化为物理领域中的反应式策略,且具有可解释性。我们通过2D导航以及模拟和真实机器人操作任务证明了我们的方法提高了模仿学习的可解释性和反应性。网站:https://sites.google.com/view/grounding-plans