Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB--IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band's cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: https://anonymous.4open.science/r/freq_decoupled_kd-5E5A
翻译:通过知识蒸馏从大规模RGB基础模型迁移学习到红外图像仍面临挑战,根本原因在于图像形成物理机制的差异。本文研究了RGB与红外模态间隙的频谱结构,发现特征差异在不同空间频率上并非均匀分布:低频分量(形状、布局)的跨模态对齐程度高于反映模态特异性的高频分量(纹理、精细边缘)。基于这一分析,我们提出FreqKD——一种频率解耦蒸馏框架,针对每个频带的跨模态一致性施加非对称监督。该方法在低频带上采用严格均方误差以保留共享结构信息,在高频带上采用松弛的对数均方误差损失(权重设为0.1),在容忍纹理差异的同时提供边缘引导。基于500对配对样本的频谱差异分析表明,在所有分析的Transformer层中,高频差异平均超过低频差异2.4倍。在KAIST多光谱行人检测任务上,FreqKD达到64.1 mAP50,较DINOv2基线提升2.4个点。所学表征可跨数据集(FLIR ADAS,+2.1 mAP50)、跨任务(MFNet分割,平均交并比+1.85)及跨架构(ResNet-50,+1.0 mAP50)迁移。代码开源地址:https://anonymous.4open.science/r/freq_decoupled_kd-5E5A