DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

from arxiv, 16 pages, 10 tables, 2 Figures. The DeBERTaV3 model significantly improves performance of the downstream NLU tasks over models with a similar structure, e.g. DeBERTaV3 large achieves 91.37% average GLUE score which is 1.37% over DeBERTa large. XSmall has only 22M backbone parameters, but significantly outperforms RoBERTa/XLNet-base. Paper is published as a conference paper at ICLR 2023

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at https://github.com/microsoft/DeBERTa.

翻译：本文提出了一种新的预训练语言模型DeBERTaV3，通过采用替换令牌检测（RTD）替代掩码语言模型（MLM）作为更高效的预训练任务，改进了原始DeBERTa模型。我们的分析表明，ELECTRA中的原始嵌入共享方法会损害训练效率和模型性能。这是因为判别器和生成器的训练损失将令牌嵌入拉向不同方向，产生了“拉锯战”动态。为此，我们提出了一种新的梯度解耦嵌入共享方法，避免了拉锯战动态，提高了训练效率和预训练模型的质量。我们使用与DeBERTa相同的设置预训练了DeBERTaV3，以证明其在广泛下游自然语言理解（NLU）任务中的卓越性能。以包含八项任务的GLUE基准为例，DeBERTaV3 Large模型取得了91.37%的平均得分，比DeBERTa高1.37%，比ELECTRA高1.91%，在相似结构的模型中创造了新的最先进（SOTA）水平。此外，我们还预训练了多语言模型mDeBERTa，并观察到相比英语模型在强基线上有更大提升。例如，mDeBERTa Base在XNLI上实现了79.8%的零样本跨语言准确率，比XLM-R Base提升3.6%，在该基准上创造了新的SOTA。我们已在https://github.com/microsoft/DeBERTa公开提供了预训练模型和推理代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【Google-斯坦福-ICLR2020】ELECTRA:预训练文本编码器作为鉴别器而不是生成器

专知会员服务

14+阅读 · 2020年3月8日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

如何构建多模态BERT? 这份UNC76页《LXMERT: 从Transformer学习跨模态编码表示》PPT告诉您，附论文代码

专知会员服务

85+阅读 · 2020年2月27日