Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.
翻译:在临床场景中,为设备端部署压缩视觉语言模型日益重要,但当师生能力差距达到一个数量级或以上时,知识蒸馏效果急剧下降。我们认为,在此类差距下,严格模仿教师并非理想目标:教师的大部分成对相似性结构反映了其自身架构偏差,而非紧凑学生模型能高效表征的信息。我们提出**对角锚定排斥知识蒸馏(DARK)**——一种对比蒸馏框架,将蒸馏损失分解为对角项(匹配图文对)和非对角项(非目标相似性)。对角项在整个训练过程中锚定匹配对的对其;非对角项从正权重退火至负权重,使学生从模仿教师非目标相似性结构过渡到**排斥**该结构。我们通过将427M参数的胎儿超声视觉语言模型FetalCLIP蒸馏为**MobileFetalCLIP**来实例化DARK,该学生模型仅75M参数,视觉编码器缩小26倍,在iPhone 16 Pro上运行仅需1.6毫秒。学生在三个零样本基准测试中达到或超越教师水平,包括HC18生物测量有效性(88.6% vs. 83.5%)和脑子平面F1(0.784 vs. 0.702)。嵌入几何与logit分析表明,DARK诱导了**结构化去相关**:学生在保持与教师对齐的逐图像置信度的同时,从继承的类间混淆中脱离,表明在极端压缩下,受控排斥比模仿更高效。