We study generalizable policy learning from demonstrations for complex low-level control tasks (e.g., contact-rich object manipulations). We propose an imitation learning method that incorporates the idea of temporal abstraction and the planning capabilities from Hierarchical RL (HRL) in a novel and effective manner. As a step towards decision foundation models, our design can utilize scalable, albeit highly sub-optimal, demonstrations. Specifically, we find certain short subsequences of the demos, i.e. the chain-of-thought (CoT), reflect their hierarchical structures by marking the completion of subgoals in the tasks. Our model learns to dynamically predict the entire CoT as coherent and structured long-term action guidance and consistently outperforms typical two-stage subgoal-conditioned policies. On the other hand, such CoT facilitates generalizable policy learning as they exemplify the decision patterns shared among demos (even those with heavy noises and randomness). Our method, Chain-of-Thought Predictive Control (CoTPC), significantly outperforms existing ones on challenging low-level manipulation tasks from scalable yet highly sub-optimal demos.
翻译:我们从复杂低级控制任务(如接触丰富的物体操作)的示范中研究可泛化的策略学习。我们提出一种模仿学习方法,以新颖且高效的方式融合了时间抽象思想与分层强化学习(HRL)的规划能力。作为迈向决策基础模型的一步,我们的设计可以利用可扩展但高度次优的示范。具体而言,我们发现示范中的某些短子序列(即链式思维(CoT))通过标记任务中子目标的完成来反映其层次结构。我们的模型学习动态预测整个CoT作为连贯且结构化的长期动作引导,且持续优于典型的两阶段子目标条件策略。另一方面,这种CoT促进了可泛化的策略学习,因为它们体现了示范之间共享的决策模式(即使是在包含大量噪声和随机性的示范中)。我们的方法——链式思维预测控制(CoTPC),在基于可扩展但高度次优示范的挑战性低级操作任务上显著优于现有方法。