Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution and thus enables utilizing both source and target inputs for model training. Extensive experiments are conducted on two sequence learning tasks: speech recognition and scene text recognition. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time.
翻译:知识蒸馏(KD)作为模型压缩的有效方法,旨在将大型网络(教师)的知识迁移至小型网络(学生)。传统KD方法通常采用有监督训练方式训练的教师模型,其中输出标签仅作为训练目标。在此基础上进一步扩展监督机制,我们针对基于连接时序分类(CTC)的序列模型提出了一种新型教师模型——Oracle Teacher,该模型同时利用源输入和输出标签作为教师模型的输入。由于Oracle Teacher通过参考目标信息学习更精确的CTC对齐,因此能为学生提供更优的指导。该方法的一个潜在风险是模型输出直接复制目标输入的平凡解。基于CTC算法的多对一映射特性,我们提出了一种能有效避免平凡解的训练策略,从而实现同时利用源输入和目标输入进行模型训练。我们在两个序列学习任务(语音识别和场景文本识别)上进行了大量实验。实验结果表明,所提模型在提升学生模型性能的同时,可显著缩短教师模型的训练时间。