In this paper we explore few-shot imitation learning for control problems, which involves learning to imitate a target policy by accessing a limited set of offline rollouts. This setting has been relatively under-explored despite its relevance to robotics and control applications. State-of-the-art methods developed to tackle few-shot imitation rely on meta-learning, which is expensive to train as it requires access to a distribution over tasks (rollouts from many target policies and variations of the base environment). Given this limitation we investigate an alternative approach, fine-tuning, a family of methods that pretrain on a single dataset and then fine-tune on unseen domain-specific data. Recent work has shown that fine-tuners outperform meta-learners in few-shot image classification tasks, especially when the data is out-of-domain. Here we evaluate to what extent this is true for control problems, proposing a simple yet effective baseline which relies on two stages: (i) training a base policy online via reinforcement learning (e.g. Soft Actor-Critic) on a single base environment, (ii) fine-tuning the base policy via behavioral cloning on a few offline rollouts of the target policy. Despite its simplicity this baseline is competitive with meta-learning methods on a variety of conditions and is able to imitate target policies trained on unseen variations of the original environment. Importantly, the proposed approach is practical and easy to implement, as it does not need any complex meta-training protocol. As a further contribution, we release an open source dataset called iMuJoCo (iMitation MuJoCo) consisting of 154 variants of popular OpenAI-Gym MuJoCo environments with associated pretrained target policies and rollouts, which can be used by the community to study few-shot imitation learning and offline reinforcement learning.
翻译:本文探索控制问题的少样本模仿学习,即通过访问有限的离线回放数据集来学习模仿目标策略。尽管该设定与机器人及控制应用高度相关,相关研究仍相对不足。现有解决少样本模仿的先进方法依赖于元学习,但元学习训练成本高昂,因为它需要访问任务分布(即来自多个目标策略的回放数据及基础环境的变体)。鉴于这一局限,我们研究另一种方法——微调,即先在单一数据集上预训练,再针对未见过的领域特定数据进行微调。近期研究表明,在少样本图像分类任务中,微调方法优于元学习方法,尤其当数据来自领域外时。本文评估这一结论在控制问题中的适用范围,提出一种简单而有效的基线方法,包含两个阶段:(i)在单一基础环境中通过强化学习(如Soft Actor-Critic)在线训练基础策略;(ii)利用目标策略的少量离线回放数据,通过行为克隆微调基础策略。尽管方法简单,该基线在多种条件下与元学习方法性能相当,并能模仿在原始环境未知变体上训练的目标策略。重要的是,所提方法实用且易于实现,无需复杂的元训练协议。作为额外贡献,我们发布名为iMuJoCo(iMitation MuJoCo)的开源数据集,包含154种流行的OpenAI-Gym MuJoCo环境变体及其预训练目标策略与回放数据,供社区用于研究少样本模仿学习与离线强化学习。