We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into 3D Gaussian Splatting scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion enriches features from a given model, such as CLIP, by leveraging 3D geometry and pairwise similarities induced by another strong model such as DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object detection tasks, highlighting the versatility of our approach.
翻译:我们致力于解决将DINO、SAM和CLIP等视觉基础模型的能力扩展至三维任务的问题。具体而言,我们提出了一种将二维图像特征提升至三维高斯溅射场景的新方法。与传统依赖最小化重建损失的方法不同,本方法采用更简单高效的特征聚合技术,并辅以图扩散机制进行增强。图扩散通过利用三维几何结构以及由另一强模型(如DINOv2)生成的成对相似性,来丰富给定模型(如CLIP)的特征。我们的方法在多项下游任务中取得了与当前最优技术相当的性能,同时实现了显著的速度提升。值得注意的是,尽管DINOv2未像SAM那样在数百万标注分割掩码上进行训练,我们使用通用的DINOv2特征仍获得了具有竞争力的分割结果。当应用于CLIP特征时,本方法在开放词汇目标检测任务中展现出强大性能,凸显了其方法的通用性。