In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distillation and model initialization to deliver higher accuracy on a broad range of transfer tasks. In generating oBERTa, we explore how the highly optimized RoBERTa differs from the BERT for pruning during pre-training and finetuning. We find it less amenable to compression during fine-tuning. We explore the use of oBERTa on seven representative NLP tasks and find that the improved compression techniques allow a pruned oBERTa model to match the performance of BERTbase and exceed the performance of Prune OFA Large on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x, respectively faster in inference. We release our code, training regimes, and associated model for broad usage to encourage usage and experimentation
翻译:本文介绍了oBERTa系列语言模型,这是一组易于使用的语言模型,可使自然语言处理从业者在无需具备模型压缩专业知识的情况下,获得推理速度提升3.8至24.3倍的模型。具体而言,oBERTa扩展了现有的剪枝、知识蒸馏和量化研究,利用冻结嵌入改进了蒸馏与模型初始化,从而在广泛迁移任务中实现更高精度。在生成oBERTa的过程中,我们探究了高度优化的RoBERTa与BERT在预训练和微调阶段剪枝行为的差异,发现其在微调过程中对压缩的适应性较差。我们在七项代表性NLP任务上测试了oBERTa,结果表明:尽管剪枝后的oBERTa模型推理速度分别比BERTbase和Prune OFA Large快8倍和2倍,但其在SQUAD V1.1问答数据集上的性能可匹配BERTbase并超越Prune OFA Large。我们开源了代码、训练方案及相关模型,以鼓励广泛使用和实验探索。