Generative Adversarial Imitation Learning (GAIL) trains a generative policy to mimic a demonstrator. It uses on-policy Reinforcement Learning (RL) to optimize a reward signal derived from a GAN-like discriminator. A major drawback of GAIL is its training instability - it inherits the complex training dynamics of GANs, and the distribution shift introduced by RL. This can cause oscillations during training, harming its sample efficiency and final policy performance. Recent work has shown that control theory can help with the convergence of a GAN's training. This paper extends this line of work, conducting a control-theoretic analysis of GAIL and deriving a novel controller that not only pushes GAIL to the desired equilibrium but also achieves asymptotic stability in a 'one-step' setting. Based on this, we propose a practical algorithm 'Controlled-GAIL' (C-GAIL). On MuJoCo tasks, our controlled variant is able to speed up the rate of convergence, reduce the range of oscillation and match the expert's distribution more closely both for vanilla GAIL and GAIL-DAC.
翻译:生成对抗模仿学习(GAIL)通过训练生成策略来模仿示范者。该方法利用基于策略的强化学习(RL)来优化由类GAN判别器生成的奖励信号。GAIL的主要缺陷在于其训练不稳定性——它既继承了GAN复杂的训练动力学特性,又受RL引入的分布偏移影响。这会导致训练过程中出现振荡,损害其样本效率与最终策略性能。最新研究表明,控制理论有助于GAN训练的收敛性。本文延续这一研究方向,对GAIL进行控制理论分析,推导出一种新型控制器,该控制器不仅能使GAIL逼近理想均衡点,还能在“单步”设置中实现渐近稳定性。基于此,我们提出实用算法“受控GAIL”(C-GAIL)。在MuJoCo任务中,我们的受控变体能够提升收敛速度、减小振荡幅度,并在原始GAIL及GAIL-DAC方法上均实现更贴近专家分布的模仿效果。