Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.
翻译:在低资源语言场景下,对大规预训练Transformer模型进行剪枝具有挑战性,因为通常需要海量重训练数据以恢复性能。例如,Distill-Whisper将Whisper模型剪枝40%后,需使用21,000小时的语音数据进行重训练,这远超大多数语言可获取的数据量。能否在数据稀缺的条件下,为边缘设备构建更轻量、更快速的Whisper模型?本研究聚焦于仅有32小时语音转文本数据的班巴拉语,提出了一种新的剪枝方案。由于班巴拉语使用者频繁进行语码转换,词汇表剪枝并不适用;因此,我们采用低秩分解与特征蒸馏技术压缩嵌入表示。我们并未直接移除网络层,而是通过层融合来限制性能损失。最终模型在MacBook Air M1上实现了48%的体积缩减与2.15倍的推理加速,同时保留了原始模型90%的性能。