Dynamic early exiting has been proven to improve the inference speed of the pre-trained language model like BERT. However, all samples must go through all consecutive layers before early exiting and more complex samples usually go through more layers, which still exists redundant computation. In this paper, we propose a novel dynamic early exiting combined with layer skipping for BERT inference named SmartBERT, which adds a skipping gate and an exiting operator into each layer of BERT. SmartBERT can adaptively skip some layers and adaptively choose whether to exit. Besides, we propose cross-layer contrastive learning and combine it into our training phases to boost the intermediate layers and classifiers which would be beneficial for early exiting. To keep the consistent usage of skipping gates between training and inference phases, we propose a hard weight mechanism during training phase. We conduct experiments on eight classification datasets of the GLUE benchmark. Experimental results show that SmartBERT achieves 2-3x computation reduction with minimal accuracy drops compared with BERT and our method outperforms previous methods in both efficiency and accuracy. Moreover, in some complex datasets like RTE and WNLI, we prove that the early exiting based on entropy hardly works, and the skipping mechanism is essential for reducing computation.
翻译:动态早退已被证明能提升BERT等预训练语言模型的推理速度。然而,所有样本在早退前仍需经过所有连续层,且复杂样本通常需经过更多层,这仍存在冗余计算。本文提出了一种名为SmartBERT的新型动态早退与层跳跃结合方法,该方法在BERT的每一层中添加跳跃门和退出算子。SmartBERT能够自适应地跳过某些层,并自适应地选择是否退出。此外,我们提出了跨层对比学习,并将其融入训练阶段,以增强中间层和分类器,从而有利于早退。为保持训练与推理阶段跳跃门使用的一致性,我们在训练阶段引入了硬权重机制。在GLUE基准的八个分类数据集上进行的实验结果表明,与BERT相比,SmartBERT在精度损失极小的情况下实现了2-3倍的计算量缩减,且我们的方法在效率和精度上均优于以往方法。此外,在RTE和WNLI等复杂数据集中,我们证明基于熵的早退几乎无效,而跳跃机制对减少计算量至关重要。