In recent years, domains such as natural language processing and image recognition have popularized the paradigm of using large datasets to pretrain representations that can be effectively transferred to downstream tasks. In this work we evaluate how such a paradigm should be done in imitation learning, where both pretraining and finetuning data are trajectories collected by experts interacting with an unknown environment. Namely, we consider a setting where the pretraining corpus consists of multitask demonstrations and the task for each demonstration is set by an unobserved latent context variable. The goal is to use the pretraining corpus to learn a low dimensional representation of the high dimensional (e.g., visual) observation space which can be transferred to a novel context for finetuning on a limited dataset of demonstrations. Among a variety of possible pretraining objectives, we argue that inverse dynamics modeling -- i.e., predicting an action given the observations appearing before and after it in the demonstration -- is well-suited to this setting. We provide empirical evidence of this claim through evaluations on a variety of simulated visuomotor manipulation problems. While previous work has attempted various theoretical explanations regarding the benefit of inverse dynamics modeling, we find that these arguments are insufficient to explain the empirical advantages often observed in our settings, and so we derive a novel analysis using a simple but general environment model.
翻译:近年来,自然语言处理和图像识别等领域流行起一种范式:使用大规模数据集预训练表示,并能有效迁移至下游任务。本文评估了如何在模仿学习中实施此类范式,其中预训练数据和微调数据均为专家在与未知环境交互过程中收集的轨迹。具体而言,我们考虑一个场景:预训练语料库由多任务演示组成,每个演示的任务由未观测的潜在上下文变量设定。目标是通过预训练语料库学习高维(例如视觉)观测空间的低维表示,该表示可迁移至新上下文,并在有限演示数据集上进行微调。在各种可能的预训练目标中,我们认为逆动力学建模——即根据演示中动作前后出现的观测预测动作——适用于此场景。我们通过在多种仿真视觉运动操控问题上的评估,提供了这一论断的实证证据。尽管先前工作尝试从理论角度解释逆动力学建模的优势,但我们发现这些论证不足以解释我们场景中常观察到的实证优势,因此我们利用一个简单但通用的环境模型推导出新颖的分析。