We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their contribution in terms of accuracy and inference speed. To the best of our knowledge, such in-depth analyses on large-scale recognition systems has not been reported in the literature. In addition, we propose a variant of low-rank approximation suitable for incrementally compressing models, and delivering multiple models with varied target sizes. Among other results, we show that a) data-driven pruning outperforms magnitude-driven in several scenarios; b) incremental pruning achieves higher accuracy compared to one-shot pruning, especially when targeting smaller sizes; and c) low-rank approximation presents the best trade-off between size reduction and inference speed-up for moderate compression.
翻译:我们研究了应用于基于Transformer的神经语言模型剪枝方法,用于自动语音识别。我们探索了剪枝框架的三个维度,即准则、方法和调度器,分析了它们在准确率和推理速度方面的贡献。据我们所知,目前文献中尚未有关于大规模识别系统的此类深入分析报告。此外,我们提出了一种适用于增量压缩模型的低秩近似变体,并提供了多种具有不同目标尺寸的模型。在众多结果中,我们表明:a) 数据驱动剪枝在多个场景下优于幅度驱动剪枝;b) 增量剪枝相比一次性剪枝能实现更高准确率,尤其是在针对较小尺寸时;c) 低秩近似在适度压缩下呈现了尺寸缩减与推理加速之间的最佳权衡。