Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.
翻译:将大规模语音基础模型(speech foundation model, SFM)蒸馏为高效的学生模型已在低资源环境中获得成功应用。尽管蒸馏降低了推理延迟,但需要额外的学生模型训练。然而,SFM蒸馏的训练效率问题仍未得到充分探索。本研究旨在加速SFM蒸馏的训练过程以加快模型部署。我们探讨了堆叠技术的潜力——通过逐步增加模型深度直至达到目标深度进行训练。现有堆叠方法虽能提升训练速度,但会导致性能下降。针对这一局限,我们提出交叉堆叠(interleaved stacking)方法,这是一种新型堆叠策略,可在整个堆叠过程中始终保留各层位置。这一特性对SFM尤为关键,因为其每一层都编码了特定层级的独特知识。我们在SUPERB基准上验证了所提方法的有效性。