Training of deep reinforcement learning agents is slowed considerably by the presence of input dimensions that do not usefully condition the reward function. Existing modules such as layer normalization can be trained with weight decay to act as a form of selective attention, i.e. an input mask, that shrinks the scale of unnecessary inputs, which in turn accelerates training of the policy. However, we find a surprising result that adding numerous parameters to the computation of the input mask results in much faster training. A simple, high dimensional masking module is compared with layer normalization and a model without any input suppression. The high dimensional mask resulted in a four-fold speedup in training over the null hypothesis and a two-fold speedup in training over the layer normalization method.
翻译:深度强化学习智能体的训练过程会因存在无法有效约束奖励函数的输入维度而显著减慢。现有模块(如层归一化)可通过权重衰减训练实现选择性注意力机制(即输入掩码),从而缩小不必要输入的尺度,进而加速策略训练。然而,我们发现一个令人意外的结果:在输入掩码计算中增加大量参数反而能大幅提升训练速度。本文对比了简单的高维掩码模块、层归一化方法以及无输入抑制的基准模型。实验表明,高维掩码相较于无抑制基准模型实现了四倍训练加速,相较于层归一化方法则达到两倍加速效果。