While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as "distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided at https://github.com/thy960112/Distillation-Dynamics.


翻译:尽管基于特征的知识蒸馏在压缩卷积神经网络(CNNs)中已被证明极为有效,但这些技术在应用于视觉Transformer(ViTs)时却意外失效,其表现往往逊于简单的基于逻辑输出的蒸馏方法。我们通过一种新颖的分析框架——‘蒸馏动力学’,结合频谱分析、信息熵度量与激活幅度追踪,首次对这一现象进行了全面分析。我们的研究发现,ViTs展现出一种独特的U形信息处理模式:先压缩后扩展。我们揭示了特征蒸馏中负迁移的根本原因:教师模型与学生模型之间存在根本性的表征范式不匹配。通过频域分析,我们表明教师模型在深层采用分布式、高维编码策略,而通道容量有限的学生模型无法复现这种策略。这种不匹配导致深层特征对齐会主动损害学生模型的性能。我们的研究结果表明,在ViTs中实现成功的知识迁移需要超越简单的特征模仿,转向尊重这些根本性表征约束的方法,从而为设计有效的ViTs压缩策略提供了关键的理论指导。所有源代码与实验日志均已发布于https://github.com/thy960112/Distillation-Dynamics。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员