We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.
翻译:我们提出了一种用于Transformer(Vaswani等人,2017)的参数共享方法。该方法放宽了广泛使用的单层参数与所有层共享的技术(例如Universal Transformers(Dehghani等人,2019)),以提高计算效率。我们设计了三种策略:序列(Sequence)、循环(Cycle)和循环反向(Cycle (rev)),用于为每一层分配参数。实验结果表明,所提策略在参数规模和计算时间方面均具有高效性。此外,我们指出,在采用大量训练数据(如近期WMT竞赛中的配置)时,这些策略同样有效。