How the level sampling process impacts zero-shot generalisation in deep reinforcement learning

A key limitation preventing the wider adoption of autonomous agents trained via deep reinforcement learning (RL) is their limited ability to generalise to new environments, even when these share similar characteristics with environments encountered during training. In this work, we investigate how a non-uniform sampling strategy of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents, considering two failure modes: overfitting and over-generalisation. As a first step, we measure the mutual information (MI) between the agent's internal representation and the set of training levels, which we find to be well-correlated to instance overfitting. In contrast to uniform sampling, adaptive sampling strategies prioritising levels based on their value loss are more effective at maintaining lower MI, which provides a novel theoretical justification for this class of techniques. We then turn our attention to unsupervised environment design (UED) methods, which adaptively generate new training levels and minimise MI more effectively than methods sampling from a fixed set. However, we find UED methods significantly shift the training distribution, resulting in over-generalisation and worse ZSG performance over the distribution of interest. To prevent both instance overfitting and over-generalisation, we introduce self-supervised environment design (SSED). SSED generates levels using a variational autoencoder, effectively reducing MI while minimising the shift with the distribution of interest, and leads to statistically significant improvements in ZSG over fixed-set level sampling strategies and UED methods.

翻译：深度强化学习训练出的自主智能体在新环境中泛化能力有限（即便这些环境与训练时遇到的环境具有相似特征），这是阻碍其更广泛应用的关键局限。本研究探讨了单个环境实例（即层级）的非均匀采样策略如何影响强化学习智能体的零样本泛化（ZSG）能力，重点考虑两种失效模式：过拟合与过泛化。作为第一步，我们测量了智能体内部表示与训练层级集合之间的互信息（MI），发现该指标与实例过拟合高度相关。与均匀采样相比，基于价值损失对层级进行优先级排序的自适应采样策略能更有效地维持较低的互信息值，这为此类技术提供了新的理论依据。继而我们关注无监督环境设计（UED）方法，该方法能够自适应生成新训练层级，并比从固定集合中采样的方法更有效地最小化互信息。然而我们发现UED方法会显著改变训练分布，导致过泛化并降低感兴趣分布上的ZSG性能。为同时防止实例过拟合和过泛化，我们提出自监督环境设计（SSED）。SSED利用变分自编码器生成层级，在有效降低互信息的同时最小化与感兴趣分布的偏移，相较于固定层级采样策略和UED方法，在ZSG性能上实现了统计显著的提升。