The ability to dynamically adjust the computational load of neural models during inference is crucial for on-device processing scenarios characterised by limited and time-varying computational resources. A promising solution is presented by early-exit architectures, in which additional exit branches are appended to intermediate layers of the encoder. In self-attention models for automatic speech recognition (ASR), early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands. Previous research on early-exiting ASR models has relied on pre-trained self-supervised models, fine-tuned with an early-exit loss. In this paper, we undertake an experimental comparison between fine-tuning pre-trained backbones and training models from scratch with the early-exiting objective. Experiments conducted on public datasets reveal that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models. Furthermore, we explore an exit selection strategy grounded in posterior probabilities as an alternative to the conventional frame-based entropy approach. Results provide insights into the training dynamics of early-exit architectures for ASR models, particularly the efficacy of training strategies and exit selection methods.
翻译:在推理过程中动态调整神经模型计算负载的能力,对于计算资源有限且随时间变化的设备端处理场景至关重要。早期退出架构为此提供了有前景的解决方案——该架构在编码器中间层附加额外的退出分支。在自动语音识别(ASR)的自注意力模型中,早期退出架构能够开发出可根据计算资源变化和ASR性能需求调整规模与架构的动态模型。现有关于早期退出ASR模型的研究多依赖于预训练的自监督模型,并通过早期退出损失进行微调。本文对基于预训练主干网络的微调与从头训练结合早期退出目标的方法进行了实验对比。公开数据集上的实验表明:与单退出或预训练模型相比,从头训练的早期退出模型在使用较少编码器层时不仅能保持性能,还能提升任务准确率。此外,我们探索了一种基于后验概率的退出选择策略,作为传统基于帧的熵值方法的替代方案。研究结果揭示了ASR模型中早期退出架构的训练动态特性,特别是训练策略与退出选择方法的效果。