Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.
翻译:当前视觉基础模型仅基于非结构化二维数据进行训练,这限制了其对物体和场景三维结构的理解。本研究表明,通过三维感知数据微调能够提升新兴语义特征的质量。我们设计了一种方法,将语义二维特征提升为高效的三维高斯表示,从而支持任意视角的重新渲染。利用渲染得到的三维感知特征,我们设计了一种微调策略,将此类三维感知能力迁移至二维基础模型中。实验证明,通过这种方式微调的模型所生成的特征,能够通过简单的线性探测显著提升下游任务(如语义分割与深度估计)的性能。值得注意的是,尽管仅在单个室内数据集上进行微调,其性能提升可迁移至多种室内数据集及域外数据集。我们希望本研究能够推动学界在训练二维基础模型时考虑注入三维感知能力。项目页面:https://ywyue.github.io/FiT3D。