Knowledge distillation (KD) has been widely employed to transfer knowledge from a large language model (LLM) to a specialized model in low-data regimes through pseudo label learning. However, pseudo labels generated by teacher models are usually noisy and may influence KD performance. This study delves into KD with noisy teachers and uncovers that the student model can already generate more accurate predictions than the teacher labels used to train it during KD, indicating its inherent ability to denoise noisy teacher labels. Motivated by this finding, we propose Peer-Advised KD to improve vanilla KD from noisy teachers. Experiments show that Peer-Advised KD can outperform LLM by approximately 5% with 50 human-labeled data, and even competitive to standard supervised finetuning with 750 human-labeled data.
翻译:知识蒸馏(KD)已被广泛用于通过伪标签学习,将大型语言模型(LLM)的知识迁移至低数据场景下的专用模型。然而,教师模型生成的伪标签通常包含噪声,可能影响KD性能。本研究深入探讨了含噪声教师的KD过程,并发现学生模型在KD训练期间,其预测结果相比用于训练它的教师标签已能产生更准确的预测,这表明学生模型具有从含噪教师标签中内在去噪的能力。基于这一发现,我们提出了一种同行建议式知识蒸馏方法,以改进标准KD在含噪教师场景下的表现。实验表明,同行建议式KD仅需50个人工标注数据即可比LLM提升约5%的性能,甚至可与使用750个人工标注数据的标准监督微调相媲美。