Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.
翻译:在当今流行的各类预训练神经语言模型中,Dropout已是一项不可或缺的正则化技术。为解决因Dropout随机性导致的训练与推理不一致性问题,部分研究采用一致性训练在输出层对Dropout进行正则化。本文提出一种专为基于Transformer的语言模型设计的新型逐层正则化Dropout(LR-Drop)。具体而言,LR-Drop利用一致性训练策略逐层正则化每个Transformer层:每个训练样本经过由Dropout采样得到的两个孪生子模型后,LR-Drop强制要求这两个孪生子模型的隐藏状态、多头注意力矩阵和输出分布保持一致。所提出的LR-Drop可视为一种"自蒸馏"框架,其中由Dropout生成的每个子模型互为彼此的"教师"模型与"学生"模型。通过在8个自然语言理解数据集、6个神经机器翻译数据集和1个抽象式摘要数据集(共15个数据集)上的大量实验,我们证明LR-Drop取得了包括最先进结果在内的卓越性能。