This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.
翻译:本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类实现压缩。此外,还探索了基于层级别不同参数聚类数量的更细粒度混合稀疏剪枝策略。在LibriSpeech数据集上的实验表明,当对HuBERT-large模型施加50%的剪枝稀疏度时:在微调前,测试清洁集和测试其他集上的词错误率较幅度剪枝分别绝对降低27.73%/18.61%(相对降低34.37%/21.91%);仅经过3轮微调后,词错误率分别绝对降低0.19%/0.79%(相对降低3.36%/4.62%)。在Whisper-large-v3模型上10%稀疏度条件下,与幅度剪枝相比,观测到2.86%/5.02%的绝对词错误率降低(相对降低59.21%/55.29%),且所有结果相对于未压缩基线均未出现显著词错误率上升。