Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distilling internal features, this allows for more freedom in the network architecture of the compressed model. We thus propose to distill HuBERT's Transformer layers into an LSTM-based distilled model that reduces the number of parameters even below DistilHuBERT and at the same time shows improved performance in automatic speech recognition.
翻译:自我监督模型虽强大但体积庞大且消耗内存,其知识压缩任务吸引了大量研究。本文表明,原始知识蒸馏方法及其最新扩展——解耦知识蒸馏,可应用于蒸馏HuBERT。与侧重于蒸馏内部特征的方法相比,本方法为压缩模型的网络架构提供了更多自由度。因此,我们提出将HuBERT的Transformer层蒸馏到基于LSTM的蒸馏模型中,该模型参数数量甚至低于DistilHuBERT,同时在自动语音识别任务中展现出更优性能。