Adversarial imitation learning has become a widely used imitation learning framework. The discriminator is often trained by taking expert demonstrations and policy trajectories as examples respectively from two categories (positive vs. negative) and the policy is then expected to produce trajectories that are indistinguishable from the expert demonstrations. But in the real world, the collected expert demonstrations are more likely to be imperfect, where only an unknown fraction of the demonstrations are optimal. Instead of treating imperfect expert demonstrations as absolutely positive or negative, we investigate unlabeled imperfect expert demonstrations as they are. A positive-unlabeled adversarial imitation learning algorithm is developed to dynamically sample expert demonstrations that can well match the trajectories from the constantly optimized agent policy. The trajectories of an initial agent policy could be closer to those non-optimal expert demonstrations, but within the framework of adversarial imitation learning, agent policy will be optimized to cheat the discriminator and produce trajectories that are similar to those optimal expert demonstrations. Theoretical analysis shows that our method learns from the imperfect demonstrations via a self-paced way. Experimental results on MuJoCo and RoboSuite platforms demonstrate the effectiveness of our method from different aspects.
翻译:对抗模仿学习已成为广泛使用的模仿学习框架。该框架通常将专家示范与策略轨迹分别作为正负两类样本训练判别器,进而期望策略生成与专家示范难以区分的轨迹。然而在实际场景中,采集的专家示范更可能是不完美的,其中仅存在未知比例的最优示范。本文不将不完美专家示范简单视为绝对正例或负例,而是直接研究未标注的不完美专家示范本身。我们提出一种正例-未标注对抗模仿学习算法,该算法能动态采样与持续优化的智能体策略轨迹相匹配的专家示范。初始智能体策略的轨迹可能更接近非最优专家示范,但在对抗模仿学习框架下,智能体策略将通过优化欺骗判别器,生成与最优专家示范相似的轨迹。理论分析表明,该方法通过自步学习方式从不完美示范中学习。在MuJoCo和RoboSuite平台上的实验从多个维度验证了该方法的有效性。