PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

In contrast to numerous NLP and 2D computer vision foundational models, the learning of a robust and highly generalized 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and the diversity of downstream tasks. In this paper, we introduce a comprehensive 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations, thereby establishing a pathway to 3D foundational models. Motivated by the fact that informative 3D features should be able to encode rich geometry and appearance cues that can be utilized to render realistic images, we propose a novel universal paradigm to learn point cloud representations by differentiable neural rendering, serving as a bridge between 3D and 2D worlds. We train a point cloud encoder within a devised volumetric neural renderer by comparing the rendered images with the real images. Notably, our approach demonstrates the seamless integration of the learned 3D encoder into diverse downstream tasks. These tasks encompass not only high-level challenges such as 3D detection and segmentation but also low-level objectives like 3D reconstruction and image synthesis, spanning both indoor and outdoor scenarios. Besides, we also illustrate the capability of pre-training a 2D backbone using the proposed universal methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks. The consistent improvements in various settings imply the effectiveness of the proposed method. Code and models will be made available at https://github.com/OpenGVLab/PonderV2.

翻译：与众多NLP和2D计算机视觉基础模型不同，学习一个鲁棒且高度泛化的3D基础模型面临更大挑战，这主要源于数据的固有变异性和下游任务的多样性。本文提出一个全面的3D预训练框架，旨在促进高效3D表征的获取，从而为3D基础模型铺就道路。基于"信息丰富的3D特征应能编码可用于渲染真实图像的丰富几何与外观线索"这一动机，我们提出一种新的通用范式，通过可微分神经渲染学习点云表征，作为3D与2D世界之间的桥梁。我们在设计的体素神经渲染器中训练点云编码器，通过比较渲染图像与真实图像。值得注意的是，我们的方法展示了将所学3D编码器无缝集成到多样化的下游任务中，这些任务不仅包括3D检测与分割等高层次挑战，还涵盖3D重建与图像合成等低层次目标，并覆盖室内外场景。此外，我们还展示了使用所提出的通用方法论预训练2D骨干网络的能力，其性能大幅超越传统预训练方法。PonderV2首次在11个室内外基准上达到最先进水平。各种设置下的持续改进表明所提方法的有效性。代码和模型将在https://github.com/OpenGVLab/PonderV2开源。