Pre-trained language models (PLMs) have recently shown great success in text representation field. However, the high computational cost and high-dimensional representation of PLMs pose significant challenges for practical applications. To make models more accessible, an effective method is to distill large models into smaller representation models. In order to relieve the issue of performance degradation after distillation, we propose a novel Knowledge Distillation method called IBKD. This approach is motivated by the Information Bottleneck principle and aims to maximize the mutual information between the final representation of the teacher and student model, while simultaneously reducing the mutual information between the student model's representation and the input data. This enables the student model to preserve important learned information while avoiding unnecessary information, thus reducing the risk of over-fitting. Empirical studies on two main downstream applications of text representation (Semantic Textual Similarity and Dense Retrieval tasks) demonstrate the effectiveness of our proposed approach.
翻译:预训练语言模型近年来在文本表示领域取得了显著成功。然而,预训练语言模型的高计算成本和高维表示给实际应用带来了重大挑战。为提升模型可及性,一种有效方法是将大型模型蒸馏为更小的表示模型。为缓解蒸馏后性能下降的问题,我们提出一种名为IBKD的新型知识蒸馏方法。该方法受信息瓶颈原理启发,旨在最大化教师模型与学生模型最终表示之间的互信息,同时减少学生模型表示与输入数据之间的互信息。这使得学生模型能够保留重要的已学习信息,同时避免冗余信息,从而降低过拟合风险。针对文本表示的两项主要下游应用(语义文本相似度任务与稠密检索任务)的实证研究表明,我们提出的方法具有有效性。