Consider learning an imitation policy on the basis of demonstrated behavior from multiple environments, with an eye towards deployment in an unseen environment. Since the observable features from each setting may be different, directly learning individual policies as mappings from features to actions is prone to spurious correlations -- and may not generalize well. However, the expert's policy is often a function of a shared latent structure underlying those observable features that is invariant across settings. By leveraging data from multiple environments, we propose Invariant Causal Imitation Learning (ICIL), a novel technique in which we learn a feature representation that is invariant across domains, on the basis of which we learn an imitation policy that matches expert behavior. To cope with transition dynamics mismatch, ICIL learns a shared representation of causal features (for all training environments), that is disentangled from the specific representations of noise variables (for each of those environments). Moreover, to ensure that the learned policy matches the observation distribution of the expert's policy, ICIL estimates the energy of the expert's observations and uses a regularization term that minimizes the imitator policy's next state energy. Experimentally, we compare our methods against several benchmarks in control and healthcare tasks and show its effectiveness in learning imitation policies capable of generalizing to unseen environments.
翻译:考虑从多个环境演示的行为中学习模仿策略,并旨在将其部署于未见环境中。由于各场景的可观测特征可能不同,直接学习将特征映射到动作的个体策略容易产生虚假关联,且泛化能力受限。然而,专家的策略通常是这些可观测特征背后共享潜在结构的函数,该结构在不同环境中具有不变性。通过利用多环境数据,我们提出不变因果模仿学习(ICIL),该创新技术学习跨域不变的特征表示,并基于此学习匹配专家行为的模仿策略。为解决转移动态失配问题,ICIL学习因果特征的共享表示(针对所有训练环境),并将其与各环境噪声变量的特定表示解耦。此外,为确保学习策略匹配专家策略的观测分布,ICIL估计专家观测的能量,并采用正则化项最小化模仿者策略的下一状态能量。实验方面,我们在控制与医疗任务中对比了多种基准方法,验证了该方法在跨环境泛化的模仿策略学习中的有效性。