One-shot imitation is to learn a new task from a single demonstration, yet it is a challenging problem to adopt it for complex tasks with the high domain diversity inherent in a non-stationary environment. To tackle the problem, we explore the compositionality of complex tasks, and present a novel skill-based imitation learning framework enabling one-shot imitation and zero-shot adaptation; from a single demonstration for a complex unseen task, a semantic skill sequence is inferred and then each skill in the sequence is converted into an action sequence optimized for environmental hidden dynamics that can vary over time. Specifically, we leverage a vision-language model to learn a semantic skill set from offline video datasets, where each skill is represented on the vision-language embedding space, and adapt meta-learning with dynamics inference to enable zero-shot skill adaptation. We evaluate our framework with various one-shot imitation scenarios for extended multi-stage Meta-world tasks, showing its superiority in learning complex tasks, generalizing to dynamics changes, and extending to different demonstration conditions and modalities, compared to other baselines.
翻译:一次性模仿学习旨在从单一示范中学会新任务,但在非平稳环境中,由于任务领域具有高度多样性,将其应用于复杂任务仍面临挑战。为解决该问题,我们探索了复杂任务的组合性,并提出了一种新颖的基于技能的模仿学习框架,实现了单次示范学习与零样本适应:对于未见过的复杂任务,从单一示范中推断出语义技能序列,随后将该序列中的每个技能转化为针对随时间变化的环境隐动态优化后的动作序列。具体而言,我们利用视觉-语言模型从离线视频数据集中学习语义技能集合,其中每个技能以视觉-语言嵌入空间表示,并采用动态推理元学习实现零样本技能适应。我们通过多个扩展的多阶段Meta-world任务的一次性模仿场景评估了该框架,结果表明,与其它基线方法相比,本框架在学习复杂任务、泛化至动态变化、以及适应不同示范条件与模态方面具有优越性。