In this paper, we study offline-to-online Imitation Learning (IL) that pretrains an imitation policy from static demonstration data, followed by fast finetuning with minimal environmental interaction. We find the na\"ive combination of existing offline IL and online IL methods tends to behave poorly in this context, because the initial discriminator (often used in online IL) operates randomly and discordantly against the policy initialization, leading to misguided policy optimization and $\textit{unlearning}$ of pretraining knowledge. To overcome this challenge, we propose a principled offline-to-online IL method, named $\texttt{OLLIE}$, that simultaneously learns a near-expert policy initialization along with an $\textit{aligned discriminator initialization}$, which can be seamlessly integrated into online IL, achieving smooth and fast finetuning. Empirically, $\texttt{OLLIE}$ consistently and significantly outperforms the baseline methods in $\textbf{20}$ challenging tasks, from continuous control to vision-based domains, in terms of performance, demonstration efficiency, and convergence speed. This work may serve as a foundation for further exploration of pretraining and finetuning in the context of IL.
翻译:本文研究从离线到在线的模仿学习(IL),该方法首先从静态演示数据中预训练一个模仿策略,随后通过最少的环境交互进行快速微调。我们发现,在此场景下,现有离线IL与在线IL方法的简单组合往往表现不佳,因为初始判别器(常用于在线IL)的运作具有随机性,且与策略初始化不协调,导致策略优化出现偏差并造成预训练知识的$\textit{遗忘}$。为克服这一挑战,我们提出一种原则性的离线到在线IL方法,命名为$\texttt{OLLIE}$,该方法同时学习一个接近专家水平的策略初始化以及一个$\textit{对齐的判别器初始化}$。该对齐初始化可无缝集成到在线IL中,实现平滑且快速的微调。实验表明,在从连续控制到视觉领域的$\textbf{20}$项挑战性任务中,$\texttt{OLLIE}$在性能、演示数据利用效率和收敛速度方面均持续且显著优于基线方法。本工作可为在IL背景下进一步探索预训练与微调奠定基础。