Knowledge distillation is an approach to transfer information on representations from a teacher to a student by reducing their difference. A challenge of this approach is to reduce the flexibility of the student's representations inducing inaccurate learning of the teacher's knowledge. To resolve it in transferring, we investigate distillation of structures of representations specified to three types: intra-feature, local inter-feature, global inter-feature structures. To transfer them, we introduce feature structure distillation methods based on the Centered Kernel Alignment, which assigns a consistent value to similar features structures and reveals more informative relations. In particular, a memory-augmented transfer method with clustering is implemented for the global structures. The methods are empirically analyzed on the nine tasks for language understanding of the GLUE dataset with Bidirectional Encoder Representations from Transformers (BERT), which is a representative neural language model. In the results, the proposed methods effectively transfer the three types of structures and improve performance compared to state-of-the-art distillation methods. Indeed, the code for the methods is available in https://github.com/maroo-sky/FSD.
翻译:知识蒸馏是一种通过减小教师与学生表示之间差异来传递信息的方法。该方法的一个挑战在于,它会限制学生表示的灵活性,导致对教师知识的学习不准确。为解决迁移中的这一问题,我们研究了三种特定类型表示结构的蒸馏:特征内结构、局部特征间结构和全局特征间结构。为迁移这些结构,我们引入了基于中心核对齐的特征结构蒸馏方法,该方法为相似特征结构赋予一致的值,并揭示更具信息量的关系。具体而言,针对全局结构实现了一种基于聚类的记忆增强迁移方法。我们以代表性子语言模型——基于Transformer的双向编码器表示(BERT)为模型,在GLUE数据集的九项语言理解任务上对方法进行了实证分析。结果表明,与最先进的蒸馏方法相比,所提出的方法有效迁移了三种类型的结构,并提升了性能。方法代码详见:https://github.com/maroo-sky/FSD。