SimC3D：一种基于RGB图像的简单对比式三维预训练框架 (SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images)

The 3D contrastive learning paradigm has demonstrated remarkable performance in downstream tasks through pretraining on point cloud data. Recent advances involve additional 2D image priors associated with 3D point clouds for further improvement. Nonetheless, these existing frameworks are constrained by the restricted range of available point cloud datasets, primarily due to the high costs of obtaining point cloud data. To this end, we propose SimC3D, a simple but effective 3D contrastive learning framework, for the first time, pretraining 3D backbones from pure RGB image data. SimC3D performs contrastive 3D pretraining with three appealing properties. (1) Pure image data: SimC3D simplifies the dependency of costly 3D point clouds and pretrains 3D backbones using solely RBG images. By employing depth estimation and suitable data processing, the monocular synthesized point cloud shows great potential for 3D pretraining. (2) Simple framework: Traditional multi-modal frameworks facilitate 3D pretraining with 2D priors by utilizing an additional 2D backbone, thereby increasing computational expense. In this paper, we empirically demonstrate that the primary benefit of the 2D modality stems from the incorporation of locality information. Inspired by this insightful observation, SimC3D directly employs 2D positional embeddings as a stronger contrastive objective, eliminating the necessity for 2D backbones and leading to considerable performance improvements. (3) Strong performance: SimC3D outperforms previous approaches that leverage ground-truth point cloud data for pretraining in various downstream tasks. Furthermore, the performance of SimC3D can be further enhanced by combining multiple image datasets, showcasing its significant potential for scalability. The code will be available at https://github.com/Dongjiahua/SimC3D.

翻译：三维对比学习范式通过点云数据预训练，在下游任务中展现出卓越性能。近期研究引入与三维点云关联的二维图像先验以进一步提升效果。然而，现有框架受限于点云数据集的可获取范围，这主要源于点云数据采集的高昂成本。为此，我们首次提出SimC3D——一个简单而有效的三维对比学习框架，能够仅使用纯RGB图像数据对三维骨干网络进行预训练。SimC3D在三维对比预训练中展现出三个突出特性：（1）纯图像数据：SimC3D降低了对昂贵三维点云数据的依赖，仅使用RGB图像进行三维骨干网络预训练。通过深度估计与适配的数据处理，单目合成点云展现出巨大的三维预训练潜力。（2）简洁框架：传统多模态框架通常需引入额外二维骨干网络来融合二维先验以辅助三维预训练，这增加了计算开销。本文通过实证研究表明，二维模态的主要优势源于局部性信息的引入。受此启发性发现，SimC3D直接采用二维位置嵌入作为更强的对比目标，在消除二维骨干网络需求的同时实现了显著的性能提升。（3）卓越性能：SimC3D在多项下游任务中超越了依赖真实点云数据进行预训练的现有方法。此外，通过融合多源图像数据集可进一步提升SimC3D的性能，彰显其强大的可扩展潜力。代码将在https://github.com/Dongjiahua/SimC3D 开源。