In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
翻译:在强化学习的零样本策略迁移设定中,目标是让智能体在固定训练环境集上学习,使其能够泛化到相似但未见的测试环境。先前研究表明,训练后的策略蒸馏有时能产生在测试环境中优于原始策略的新策略。然而,其内在机理尚未完全明晰,且用于蒸馏策略的数据选择标准亦不明确。本文在特定假设下证明了训练后策略蒸馏的泛化边界理论。该理论给出两项实践启示:为提升泛化能力,应 1)训练蒸馏策略集成,2)使用尽可能多的训练环境数据进行蒸馏。我们通过实验验证了这些启示在更普适场景下的有效性——即使理论所需假设不再成立。最终实验表明,基于多样化数据集蒸馏得到的策略集成,其泛化能力显著优于原始智能体。