In offline imitation learning (IL), an agent aims to learn an optimal expert behavior policy without additional online environment interactions. However, in many real-world scenarios, such as robotics manipulation, the offline dataset is collected from suboptimal behaviors without rewards. Due to the scarce expert data, the agents usually suffer from simply memorizing poor trajectories and are vulnerable to variations in the environments, lacking the capability of generalizing to new environments. To automatically generate high-quality expert data and improve the generalization ability of the agent, we propose a framework named \underline{O}ffline \underline{I}mitation \underline{L}earning with \underline{C}ounterfactual data \underline{A}ugmentation (OILCA) by doing counterfactual inference. In particular, we leverage identifiable variational autoencoder to generate \textit{counterfactual} samples for expert data augmentation. We theoretically analyze the influence of the generated expert data and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both \textsc{DeepMind Control Suite} benchmark for in-distribution performance and \textsc{CausalWorld} benchmark for out-of-distribution generalization. Our code is available at \url{https://github.com/ZexuSun/OILCA-NeurIPS23}.
翻译:在离线模仿学习中,智能体旨在无需额外在线环境交互的情况下学习最优专家行为策略。然而,在许多真实场景(如机器人操作)中,离线数据集由缺乏奖励的次优行为采集而来。由于专家数据稀缺,智能体往往陷入简单记忆低质量轨迹的困境,且对环境变化敏感,缺乏泛化至新环境的能力。为自动生成高质量专家数据并提升智能体的泛化能力,我们提出名为OILCA(基于反事实数据增强的离线模仿学习框架)的方法,通过反事实推断实现目标。具体而言,我们利用可辨识变分自编码器生成反事实样本以增强专家数据。我们从理论上分析了生成数据的影响及泛化性能的提升,并在DeepMind Control Suite基准(用于分布内性能评估)和CausalWorld基准(用于分布外泛化评估)上进行大量实验,证明我们的方法显著优于各类基线模型。代码已开源至https://github.com/ZexuSun/OILCA-NeurIPS23。