We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline.
翻译:本文提出DropDim——一种针对自注意力机制(Transformer核心组件)的结构化丢弃正则化方法。与随机丢弃神经元的通用丢弃法不同,DropDim通过丢弃部分嵌入维度实现正则化。该方法可完全丢弃语义信息,从而打破不同嵌入维度间的过度共适应,迫使自注意力机制在部分嵌入维度被擦除的情况下仍能编码有效特征。在MUST-C英德数据集上执行的多项任务实验表明:DropDim能有效提升模型性能、减少过拟合,并与其他正则化方法呈现互补效应。当与标签平滑联合使用时,ASR任务的词错误率(WER)从19.1%降至15.1%,MT任务的BLEU值从26.90提升至28.38;在ST任务中,模型BLEU值达到22.99,较强基准模型提升1.86个BLEU点。