Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
翻译:DINOv2等预训练视觉编码器在单模态任务中展现出卓越性能。然而,我们观察到其特征表示在不同模态间存在严重错位。例如,同一场景的RGB图像与其对应深度图之间的特征嵌入余弦相似度,几乎等同于两个随机无关图像之间的相似度。为解决此问题,我们提出全模态视觉编码器——一种学习模态无关特征空间的新型框架。该编码器采用双重目标进行训练:首先,最大化同一场景不同模态间的特征对齐;其次,通过蒸馏目标将学习到的表征锚定至完全冻结的教师模型(如DINOv2)的输出。由此产生的学生编码器能够为给定场景生成一致且强大的嵌入表示,无论输入模态为何(RGB、深度、分割图等),从而成为"全模态"编码器。该方法在保持原始基础模型判别语义的同时,实现了鲁棒的跨模态理解。