Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.
翻译:模型拼接通过轻量级拼接层将一个模型(源模型)的早期层与另一个模型(目标模型)的后期层相连接,长期以来被用作表征兼容性的探针。先前研究发现,尽管初始化或目标函数不同,在同一数据集上训练的模型仍可保持可拼接性(精度下降可忽略)。本文重新审视了在目标函数、数据及模态组合(例如 CLIP、DINOv2、SigLIP 2)上存在差异的视觉基础模型(VFMs)的拼接问题,并探讨:异构 VFMs 是否可拼接?我们引入了一个系统化的评估协议,涵盖拼接点、拼接层族、训练损失函数及下游任务。研究得出三个主要发现。(1)拼接层训练至关重要:传统方法(如在拼接点匹配中间特征或端到端优化任务损失)难以保持精度,尤其在浅层拼接点时。(2)采用目标模型倒数第二层的简单特征匹配损失,异构 VFMs 可在多种视觉任务上实现可靠拼接。(3)对于深层拼接点,拼接模型能以较小的推理开销(仅针对拼接层)超越任一组成模型的性能。基于这些发现,我们进一步提出了 VFM 拼接树(VST),该方法在多个 VFMs 间共享早期层,同时保留各自的后期层,从而为常利用多个 VFMs 的多模态大语言模型提供可控的精度-延迟权衡方案。综上所述,本研究将拼接从一种诊断性探针提升为整合互补性 VFM 优势、并精确定位其表征对齐或分歧位置的实际方法。