Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher's output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher's internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model's generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher's embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.
翻译:知识蒸馏(KD)已成为将知识从更大、更复杂的教师模型迁移到较小学生模型的一种有前景的方法。传统上,KD涉及训练学生模型以模仿教师模型的输出概率,而更先进的技术则探索引导学生模型采用教师模型的内部表示。尽管取得了广泛成功,但KD在二分类和少类别问题中的表现仍不尽如人意。这是因为教师模型泛化模式的信息量与类别数量直接相关。此外,几种复杂的蒸馏方法可能无法普遍适用于计算机视觉以外的数据类型,或在此类数据上效果有限。因此,对于一系列关键的实际应用(如情感分析、搜索查询理解和广告-查询相关性评估),有效的蒸馏技术仍然难以实现。基于这些观察,我们提出了一种从教师模型表示中蒸馏知识的新方法,称为学习嵌入线性投影(LELP)。受最近关于最终层表示结构的研究启发,LELP通过识别教师嵌入空间中的信息性线性子空间,并将其划分为伪子类来工作。随后训练学生模型以复现这些伪类别。我们在Amazon Reviews和Sentiment140等大规模NLP基准上的实验评估表明,对于大多数KD方法效果不佳的二分类和少类别问题,LELP始终与现有最先进的蒸馏算法具有竞争力,且通常优于这些方法。