Transformer-based speech recognition (ASR) model with deep layers exhibited significant performance improvement. However, the model is inefficient for deployment on resource-constrained devices. Layer pruning (LP) is a commonly used compression method to remove redundant layers. Previous studies on LP usually identify the redundant layers according to a task-specific evaluation metric. They are time-consuming for models with a large number of layers, even in a greedy search manner. To address this problem, we propose CoMFLP, a fast search LP algorithm based on correlation measure. The correlation between layers is computed to generate a correlation matrix, which identifies the redundancy among layers. The search process is carried out in two steps: (1) coarse search: to determine top $K$ candidates by pruning the most redundant layers based on the correlation matrix; (2) fine search: to select the best pruning proposal among $K$ candidates using a task-specific evaluation metric. Experiments on an ASR task show that the pruning proposal determined by CoMFLP outperforms existing LP methods while only requiring constant time complexity. The code is publicly available at https://github.com/louislau1129/CoMFLP.
翻译:基于Transformer的深度语音识别(ASR)模型在性能上取得了显著提升,但难以部署在资源受限设备上。层剪枝(LP)是一种常用的压缩方法,用于移除冗余层。以往LP研究通常根据任务特定评估指标识别冗余层,即便采用贪心搜索方式,处理大规模层数模型仍耗时严重。针对此问题,我们提出CoMFLP——一种基于相关性度量的快速搜索层剪枝算法。通过计算层间相关性生成相关性矩阵,识别层冗余性。搜索过程分为两步:(1)粗搜索:基于相关性矩阵剪除最冗余层,确定前K个候选方案;(2)精搜索:采用任务特定评估指标从K个候选方案中选出最优剪枝方案。ASR任务实验表明,CoMFLP确定的剪枝方案在保持常数时间复杂度的情况下,性能优于现有LP方法。代码已开源:https://github.com/louislau1129/CoMFLP。