Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model. Traditional methods include response-based methods and feature-based methods. Response-based methods are widely used but suffer from lower upper limits of performance due to their ignorance of intermediate signals, while feature-based methods have constraints on vocabularies, tokenizers and model architectures. In this paper, we propose a liberal feature-based distillation method (LEAD). LEAD aligns the distribution between the intermediate layers of teacher model and student model, which is effective, extendable, portable and has no requirements on vocabularies, tokenizers, or model architectures. Extensive experiments show the effectiveness of LEAD on widely-used benchmarks, including MS MARCO Passage Ranking, TREC 2019 DL Track, MS MARCO Document Ranking and TREC 2020 DL Track. Our code is available in https://github.com/microsoft/SimXNS/tree/main/LEAD.
翻译:知识蒸馏常用于将强教师模型的知识迁移至相对较弱的学生模型。传统方法包括基于响应的方法和基于特征的方法。基于响应的方法虽应用广泛,但因忽略中间信号而存在性能上限较低的问题;而基于特征的方法则在词汇表、分词器和模型架构上存在约束。本文提出一种基于自由特征的蒸馏方法(LEAD)。LEAD通过对齐教师模型与学生模型中间层之间的分布,该方法兼具高效性、可扩展性与可移植性,且对词汇表、分词器或模型架构无任何要求。大量实验验证了LEAD在广泛使用的基准测试上的有效性,包括MS MARCO段落排序、TREC 2019 DL赛道、MS MARCO文档排序及TREC 2020 DL赛道。我们的代码开源在https://github.com/microsoft/SimXNS/tree/main/LEAD。