This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.
翻译:本文提出了一种针对语音基础模型的新型无数据无训练压缩方法,该方法通过基于K-means的通道级聚类实现参数压缩。进一步探索了基于层级可变参数聚类数的细粒度混合稀疏剪枝策略。在LibriSpeech数据集上的实验表明:在HuBERT-large模型上应用50%剪枝稀疏度时,与基于幅度的剪枝方法相比,微调前在test-clean和test-other子集上实现了绝对词错误率(WER)降低27.73%/18.61%(相对降低34.37%/21.91%),仅经过3轮微调后绝对降低0.19%/0.79%(相对降低3.36%/4.62%)。在Whisper-large-v3模型上以10%稀疏度进行实验时,观察到与基于幅度的剪枝相比实现了2.86%/5.02%的绝对WER降低(相对降低59.21%/55.29%),且所有实验相较未压缩基线均未出现显著的WER上升。