In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked whether to train each layer $i$ independently or to copy the weights of a previous layer $j<i$. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
翻译:在减少深度Transformer网络中可训练参数数量的探索中,我们采用强化学习在训练过程中动态选择层并进行绑定。每隔若干次迭代,强化学习智能体需决定是独立训练第$i$层,还是复制之前某层$j<i$的权重。这一机制促进了权重共享,减少了可训练参数数量,同时作为一种有效的正则化技术。实验评估表明,我们的模型在困惑度指标上略微优于基准Transformer模型,并大幅减少了可训练参数数量。特别地,训练期间的内存消耗比传统训练方法低一个数量级。