Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations.Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits.These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student's intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher's feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared logits.We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher's logits to rectify classification.To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in students.Extensive experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.
翻译:知识蒸馏(KD)将暗知识从复杂的教师模型迁移至紧凑的学生模型。然而,异构架构蒸馏(例如从Vision Transformer(ViT)到ResNet18)因空间特征表示的差异而面临挑战。传统KD方法大多针对同构架构设计,因此难以有效解决这种差异。尽管近期已开发出异构KD方法以应对这些问题,但这些方法通常计算成本高、设计复杂,或过度依赖对数对齐,限制了其利用互补特征的能力。为克服这些局限,我们提出异构互补蒸馏(HCD),这是一个简单而有效的框架,通过整合教师与学生模型的互补特征,在共享对数中对齐特征表示。这些对数被分解并施加约束,以促进多样化的知识向学生模型迁移。具体而言,HCD通过卷积投影器和自适应池化处理学生模型的中间特征,将其与教师模型倒数第二层的特征拼接,随后通过由全连接层构成的互补特征映射器(CFM)模块映射生成共享对数。我们进一步提出子对数解耦蒸馏(SDD),将共享对数划分为n个子对数,并与教师模型的对数融合以修正分类。为确保子对数的多样性并减少冗余知识迁移,我们提出正交性损失(OL)。通过保留学生模型的特定优势并利用教师知识,HCD增强了学生模型的鲁棒性和泛化能力。在CIFAR-100、细粒度数据集(如CUB200)及ImageNet-1K上的大量实验表明,HCD优于当前最先进的KD方法,确立了其作为异构KD有效解决方案的地位。