In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training.
翻译:在深度强化学习(RL)中,数据增强被广泛视为一种工具,用于引入关于语义一致性的有用先验知识,并提升样本效率与泛化性能。然而,即使先验知识有助于泛化,将其蒸馏到强化学习智能体中往往干扰RL训练,反而降低样本效率。同时,由于RL的非平稳特性,智能体会遗忘这些先验知识。这些观察结果揭示了两种极端的蒸馏调度策略:(i) 在整个训练过程中持续进行蒸馏;或 (ii) 仅在训练结束时进行蒸馏。为此,我们设计了一种独立的网络蒸馏方法,可在任意时刻(包括RL训练后)注入一致性先验,并构建了一个简洁高效的框架来自动调度蒸馏过程。具体而言,该框架首先通过自适应决定训练中是否使用增强(或使用何种增强),专注于掌握训练环境而暂不考虑泛化性能。随后,我们通过蒸馏从所有增强中提取剩余泛化增益,这一过程无需额外的新样本。实验表明,所提框架具有实用价值,特别在将增强推迟至RL训练结束时的方案中效果显著。