In recent times, BERT-based models have been extremely successful in solving a variety of natural language processing (NLP) tasks such as reading comprehension, natural language inference, sentiment analysis, etc. All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature. In this work we investigate the importance of intermediate layers on the overall network performance of downstream tasks. We show that reducing the number of intermediate layers and modifying the architecture for BERT-BASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing the number of parameters and training time of the model. Additionally, we use centered kernel alignment and probing linear classifiers to gain insight into our architectural modifications and justify that removal of intermediate layers has little impact on the fine-tuned accuracy.
翻译:近年来,基于BERT的模型在阅读理解、自然语言推理、情感分析等多种自然语言处理任务中取得了巨大成功。所有基于BERT的架构都以自注意力模块后接中间层模块作为基本构建单元。然而,文献中始终缺乏对引入这些中间层的充分论证。本研究探讨中间层对下游任务整体网络性能的影响。实验表明,在BERT-BASE架构中减少中间层数量并调整结构,可在保持下游任务微调精度损失极小的情况下,同时减少模型参数量和训练时间。此外,我们通过中心核对齐与线性分类器探测技术深入分析架构改进的机理,证实移除中间层对微调精度的影响微乎其微。