Knowledge distillation is an effective way to transfer knowledge from a strong teacher to an efficient student model. Ideally, we expect the better the teacher is, the better the student. However, this expectation does not always come true. It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. To bridge the gap, we propose PROD, a PROgressive Distillation method, for dense retrieval. PROD consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student. We conduct extensive experiments on five widely-used benchmarks, MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document and Natural Questions, where PROD achieves the state-of-the-art within the distillation methods for dense retrieval. The code and models will be released.
翻译:知识蒸馏是将强教师模型的知识有效迁移至高效学生模型的一种方法。理想情况下,我们期望教师模型越强,学生模型的表现也越好。然而,这一期望并非总能实现。由于教师模型与学生模型之间存在不可忽视的差距,通过蒸馏方式反而可能导致更强的教师模型产生表现更差的学生模型。为弥合这一差距,我们提出PROD——一种面向密集检索的渐进式蒸馏方法。PROD包含教师渐进蒸馏与数据渐进蒸馏两个模块,通过逐步提升的方式优化学生模型。我们在五个广泛使用的基准数据集(MS MARCO Passage、TREC Passage 19、TREC Document 19、MS MARCO Document和Natural Questions)上开展了大量实验,结果表明PROD在密集检索蒸馏方法中达到了当前最优水平。相关代码与模型将开源发布。