Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectures. In the context of Automatic Speech Recognition (ASR), previous studies have predominantly focused on forcing the student model to strictly mimic the predictive distribution of a massive teacher model. However, this static dependency often presents an inherent trade-off: while the student rapidly acquires basic linguistic representations, it simultaneously inherits the teacher's domain-specific blind spots and over-confident hallucinations, leading to a severe decline in out-of-distribution generalization capacity. To effectively mitigate this issue, we propose Adaptive Self-Knowledge Distillation (ASKD), a dynamic curriculum framework. ASKD systematically decays the dependency on the teacher's distribution as training progresses-thereby unlocking the student's independent reasoning capacity-and subsequently employs a self-knowledge distillation phase to act as a structural regularizer. By applying ASKD, we distill the massive Whisper architecture into a compact variant, ASKD-Whisper. In our comprehensive evaluations across diverse acoustic domains, ASKD-Whisper not only achieves a 5x speedup in inference latency but also outperforms its teacher model by yielding a 1.07% lower word error rate (WER). These results demonstrate that ASKD effectively prevents teacher-induced overfitting and establishes a new state-of-the-art for generalizable model compression.
翻译:知识蒸馏(KD)是压缩大规模基础模型至可部署架构的最有效范式之一。在自动语音识别(ASR)背景下,以往研究主要聚焦于强制学生模型严格模仿庞大教师模型的预测分布。然而,这种静态依赖往往存在内在权衡:学生模型虽能快速习得基础语言表征,但同时继承了教师领域的特定盲区与过度自信的幻觉,导致其分布外泛化能力显著下降。为有效缓解此问题,我们提出动态课程框架——自适应自知识蒸馏(ASKD)。ASKD系统性地随训练进程衰减对教师分布的学习依赖(从而激活学生模型的独立推理能力),并后续采用自知识蒸馏阶段作为结构正则化器。通过应用ASKD,我们将庞大的Whisper架构蒸馏为紧凑变体ASKD-Whisper。在多声学领域的综合评估中,ASKD-Whisper不仅实现5倍推理加速,更以1.07%的词错误率(WER)降幅优于其教师模型。这些结果表明,ASKD有效防止了教师模型导致的过拟合,并建立了可泛化模型压缩领域的新基准。