In contrast to numerous NLP and 2D vision foundational models, learning a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation, thereby establishing a pathway to 3D foundational models. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a devised volumetric neural renderer by comparing the rendered with the real images. Notably, our approach seamlessly integrates the learned 3D encoder into various downstream tasks. These tasks encompass not only high-level challenges such as 3D detection and segmentation but also low-level objectives like 3D reconstruction and image synthesis, spanning both indoor and outdoor scenarios. Besides, we also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness. Code and models are available at https://github.com/OpenGVLab/PonderV2.
翻译:与众多自然语言处理和2D视觉基础模型不同,学习3D基础模型面临着显著更大的挑战,这主要源于数据固有的可变性以及下游任务的多样性。本文提出了一种新颖的通用3D预训练框架,旨在促进高效3D表征的获取,从而为3D基础模型建立一条通路。考虑到信息丰富的3D特征应编码可用于渲染逼真图像的几何与外观线索,我们建议通过可微神经渲染来学习3D表示。我们利用设计的体积神经渲染器,通过比较渲染图像与真实图像来训练3D主干网络。值得注意的是,我们的方法将学习到的3D编码器无缝集成到各种下游任务中,这些任务不仅涵盖3D检测和分割等高层挑战,还包括3D重建和图像合成等低层目标,同时适用于室内和室外场景。此外,我们还展示了使用所提出的方法预训练2D主干网络的能力,其性能大幅超越传统预训练方法。PonderV2首次在11个室内外基准测试中达到了最先进的性能,证明了其有效性。代码和模型可在https://github.com/OpenGVLab/PonderV2获取。