Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more. Typically, it is found that larger models lead to better performance. However, the significant computational effort involved in such large transformer systems is a challenge for embedded and real-world applications. Recent work has shown that there is significant redundancy in the transformer models for NLP and massive layer pruning is feasible (Sajjad et al., 2023). Here, we investigate layer pruning in audio models. We base the pruning decision on a convexity criterion. Convexity of classification regions has recently been proposed as an indicator of subsequent fine-tuning performance in a range of application domains, including NLP and audio. In empirical investigations, we find a massive reduction in the computational effort with no loss of performance or even improvements in certain cases.
翻译:基于Transformer架构并通过自监督学习训练的语音表示模型,在语音识别、说话人识别、关键词检测、情感检测等任务中展现出巨大潜力。通常,模型规模越大,性能表现越优。然而,这类大型Transformer系统所涉及的高计算成本,对嵌入式及实际应用场景构成了挑战。近期研究表明,自然语言处理领域的Transformer模型存在显著冗余,大规模层剪枝是可行的(Sajjad等人,2023)。本文针对音频模型中的层剪枝展开研究。我们依据凸性准则制定剪枝决策。分类区域的凸性最近被提出作为一系列应用领域(包括自然语言处理和音频)中后续微调性能的指标。通过实证研究,我们发现计算量得以大幅降低,且性能未受损失,在某些情况下甚至有所提升。