Automatic Speech Recognition (ASR) has seen remarkable advancements with deep neural networks, such as Transformer and Conformer. However, these models typically have large model sizes and high inference costs, posing a challenge to deploy on resource-limited devices. In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance. Our approach utilizes a set of binary masks to indicate whether to retain or prune each Conformer module, and employs L0 regularization to learn the optimal mask values. To further enhance pruning performance, we use a layerwise distillation strategy to transfer knowledge from unpruned to pruned models. Our method outperforms all pruning baselines on the widely used LibriSpeech benchmark, achieving a 50% reduction in model size and a 28% reduction in inference cost with minimal performance loss.
翻译:自动语音识别(ASR)技术随着深度神经网络(如Transformer和Conformer)的发展取得了显著进步。然而,这些模型通常具有较大的模型尺寸和高昂的推理成本,这使得在资源受限设备上部署面临挑战。本文提出了一种新型压缩策略,通过结合结构化剪枝与知识蒸馏,在保持高识别性能的同时,降低Conformer模型的尺寸和推理成本。该方法采用一组二进制掩码指示保留或剪除每个Conformer模块,并利用L0正则化学习最优掩码值。为进一步提升剪枝性能,我们采用逐层蒸馏策略将未剪枝模型的知识迁移至剪枝后的模型。在广泛使用的LibriSpeech基准测试中,该方法优于所有剪枝基线方案,实现了模型尺寸减少50%且推理成本降低28%,同时仅带来极小的性能损失。