Knowledge distillation is an effective way to transfer knowledge from a strong teacher to an efficient student model. Ideally, we expect the better the teacher is, the better the student. However, this expectation does not always come true. It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. To bridge the gap, we propose PROD, a PROgressive Distillation method, for dense retrieval. PROD consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student. We conduct extensive experiments on five widely-used benchmarks, MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document and Natural Questions, where PROD achieves the state-of-the-art within the distillation methods for dense retrieval. The code and models will be released.
翻译:知识蒸馏是将强大教师模型的知识有效迁移至高效学生模型的手段。理想情况下,我们期望教师越强,学生表现越优。然而,这一预期并非总能实现。由于教师与学生之间存在不可忽视的差距,常出现通过蒸馏后更强的教师模型反而导致学生性能不佳的情况。为弥合这一差距,我们提出面向稠密检索的渐进式蒸馏方法PROD。PROD包含教师渐进式蒸馏与数据渐进式蒸馏两个模块,可逐步提升学生模型性能。我们在MS MARCO Passage、TREC Passage 19、TREC Document 19、MS MARCO Document及Natural Questions五个广泛使用的基准数据集上开展大量实验,结果表明PROD在稠密检索蒸馏方法中达到了最优水平。相关代码与模型将予以公开。