Currently, pre-trained models can be considered the default choice for a wide range of NLP tasks. Despite their SoTA results, there is practical evidence that these models may require a different number of computing layers for different input sequences, since evaluating all layers leads to overconfidence in wrong predictions (namely overthinking). This problem can potentially be solved by implementing adaptive computation time approaches, which were first designed to improve inference speed. Recently proposed PonderNet may be a promising solution for performing an early exit by treating the exit layer's index as a latent variable. However, the originally proposed exit criterion, relying on sampling from trained posterior distribution on the probability of exiting from the $i$-th layer, introduces major variance in exit layer indices, significantly reducing the resulting model's performance. In this paper, we propose improving PonderNet with a novel deterministic Q-exit criterion and a revisited model architecture. We adapted the proposed mechanism to ALBERT and RoBERTa and compared it with recent methods for performing an early exit. We observed that the proposed changes can be considered significant improvements on the original PonderNet architecture and outperform PABEE on a wide range of GLUE tasks. In addition, we also performed an in-depth ablation study of the proposed architecture to further understand Lambda layers and their performance.
翻译:当前,预训练模型可被视为各类自然语言处理任务的默认选择。尽管它们取得了最先进的(SoTA)结果,但实践证据表明,这些模型对不同输入序列所需的计算层数可能不同,因为评估所有层会导致错误预测的过度置信(即过度思考)。通过实现自适应计算时间方法(最初旨在提升推理速度),这一问题有望得到解决。最近提出的PonderNet通过将退出层的索引视为潜在变量,提供了一种执行早期退出的有前景方案。然而,其最初的退出准则依赖于从训练得到的后验分布(即从第$i$层退出的概率)中进行采样,这引入了退出层索引的显著方差,大幅降低了模型的最终性能。本文提出用新颖的确定性Q-退出准则和改进的模型架构来优化PonderNet。我们将所提出的机制适配到ALBERT和RoBERTa中,并与最新的早期退出方法进行了对比。我们观察到,这些改进可被视为对原始PonderNet架构的重大提升,并且在广泛的GLUE任务上优于PABEE。此外,我们还对所提架构进行了深入的消融研究,以进一步理解Lambda层及其性能。