In this paper, we propose a method for transferring feature representation to lightweight student models from larger teacher models. We mathematically define a new notion called \textit{perception coherence}. Based on this notion, we propose a loss function, which takes into account the dissimilarities between data points in feature space through their ranking. At a high level, by minimizing this loss function, the student model learns to mimic how the teacher model \textit{perceives} inputs. More precisely, our method is motivated by the fact that the representational capacity of the student model is weaker than the teacher model. Hence, we aim to develop a new method allowing for a better relaxation. This means that, the student model does not need to preserve the absolute geometry of the teacher one, while preserving global coherence through dissimilarity ranking. Importantly, while rankings are defined only on finite sets, our notion of \textit{perception coherence} extends them into a probabilistic form. This formulation depends on the input distribution and applies to general dissimilarity metrics. Our theoretical insights provide a probabilistic perspective on the process of feature representation transfer. Our experiments results show that our method outperforms or achieves on-par performance compared to strong baseline methods for representation transferring.
翻译:本文提出了一种将特征表示从较大的教师模型迁移至轻量级学生模型的方法。我们数学定义了一个称为“感知一致性”的新概念。基于此概念,我们提出了一种损失函数,该函数通过数据点在特征空间中的排序差异来考虑它们之间的相异性。从高层次看,通过最小化该损失函数,学生模型能够学习模仿教师模型如何“感知”输入。更精确地说,我们的方法源于学生模型的表示能力弱于教师模型这一事实。因此,我们旨在开发一种允许更好松弛度的新方法。这意味着学生模型无需保持教师模型的绝对几何结构,而是通过相异性排序保持全局一致性。重要的是,虽然排序仅在有限集合上定义,但我们的“感知一致性”概念将其扩展为概率形式。该公式依赖于输入分布,并适用于一般的相异性度量。我们的理论见解为特征表示迁移过程提供了概率视角。实验结果表明,在表示迁移任务中,我们的方法优于或达到了与强基线方法相当的性能。