Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
翻译:视觉基础模型(VFMs)在应用于各种下游二维任务时取得了显著成功。尽管其效果显著,但这些模型通常表现出严重缺乏三维感知能力的问题。为此,我们提出了溅射与蒸馏框架,该框架通过为教师模型集成一个快速的前馈三维重建流程,将鲁棒的三维感知能力注入二维视觉基础模型中。给定教师模型生成的二维特征,我们的方法首先以前馈方式将这些特征提升为显式的三维高斯表示。随后,这些三维特征被“溅射”到新视角上,生成一组用于监督学生模型的新二维特征图,从而“蒸馏”出几何基础的知识。通过用我们的前馈提升方法取代先前工作中缓慢的逐场景优化,我们的框架避免了特征平均伪影,创建了一个动态学习过程,使得教师模型的一致性随着学生模型的改进而同步提升。我们在包括单目深度估计、表面法线估计、多视角对应性以及语义分割在内的一系列下游任务上进行了全面评估。我们的方法显著优于先前工作,不仅在三维感知方面实现了实质性提升,还增强了二维特征底层的语义丰富性。项目页面详见 https://davidshavin4.github.io/Splat-and-Distill/