蒸馏动力学：理解视觉Transformer中基于特征的蒸馏 (Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers)

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.

翻译：尽管基于特征的知识蒸馏在压缩卷积神经网络（CNNs）方面已被证明极为有效，但这些技术在应用于视觉Transformer（ViTs）时却意外失效，其表现甚至常逊于简单的基于logit的蒸馏。我们通过一个称为“蒸馏动力学”的新型分析框架，结合频谱分析、信息熵度量和激活幅度追踪，首次对这一现象进行了全面分析。我们的研究发现，ViTs展现出一种独特的U形信息处理模式：初始压缩后跟随扩张。我们识别出特征蒸馏中负迁移的根本原因：教师模型与学生模型之间存在根本性的表示范式不匹配。通过频域分析，我们表明教师模型在后续层中采用了分布式、高维的编码策略，而较小的学生模型由于有限的通道容量无法复现。这种不匹配导致后层特征对齐会主动损害学生模型的性能。我们的发现揭示，在ViTs中实现成功的知识迁移需要超越朴素的特征模仿，转向尊重这些基本表示约束的方法，从而为设计有效的ViTs压缩策略提供了必要的理论指导。所有源代码和实验日志均在补充材料中提供。

相关内容