DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.

翻译：在临床场景中，为设备端部署压缩视觉语言模型日益重要，但当师生能力差距达到一个数量级或以上时，知识蒸馏效果急剧下降。我们认为，在此类差距下，严格模仿教师并非理想目标：教师的大部分成对相似性结构反映了其自身架构偏差，而非紧凑学生模型能高效表征的信息。我们提出**对角锚定排斥知识蒸馏（DARK）**——一种对比蒸馏框架，将蒸馏损失分解为对角项（匹配图文对）和非对角项（非目标相似性）。对角项在整个训练过程中锚定匹配对的对其；非对角项从正权重退火至负权重，使学生从模仿教师非目标相似性结构过渡到**排斥**该结构。我们通过将427M参数的胎儿超声视觉语言模型FetalCLIP蒸馏为**MobileFetalCLIP**来实例化DARK，该学生模型仅75M参数，视觉编码器缩小26倍，在iPhone 16 Pro上运行仅需1.6毫秒。学生在三个零样本基准测试中达到或超越教师水平，包括HC18生物测量有效性（88.6% vs. 83.5%）和脑子平面F1（0.784 vs. 0.702）。嵌入几何与logit分析表明，DARK诱导了**结构化去相关**：学生在保持与教师对齐的逐图像置信度的同时，从继承的类间混淆中脱离，表明在极端压缩下，受控排斥比模仿更高效。