Training Memory-based transformers can require a large amount of memory and can be quite inefficient. We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers, which are often used for long-range context problems. For our experiments, we consider transformer-XL as our baseline model which is one of memorybased transformer models. We show that our resultant model, Skip Cross-head TransformerXL, outperforms the baseline on character level language modeling task with similar parameters and outperforms the baseline on word level language modelling task with almost 20% fewer parameters. Our proposed methods do not require any additional memory. We also demonstrate the effectiveness of our regularization mechanism on BERT which shows similar performance with reduction in standard deviation of scores of around 30% on multiple GLUE tasks.
翻译:训练基于记忆的Transformer通常需要大量内存且效率低下。本文提出了一种新型两阶段训练机制与正则化技术,旨在提升常用于长程上下文问题的记忆型Transformer的训练效率。在实验中,我们以Transformer-XL作为记忆型Transformer模型的基线。结果表明,所提出的Skip Cross-head TransformerXL在参数量相近的情况下,在字符级语言建模任务中优于基线模型;而在词级语言建模任务中,其参数量减少近20%的情况下仍表现更优。所提方法无需额外内存开销。此外,我们验证了正则化机制在BERT上的有效性,该机制在多个GLUE任务上将得分标准差降低约30%的同时保持性能相当。