State-of-the-art 3D models, which excel in recognition tasks, typically depend on large-scale datasets and well-defined category sets. Recent advances in multi-modal pre-training have demonstrated potential in learning 3D representations by aligning features from 3D shapes with their 2D RGB or depth counterparts. However, these existing frameworks often rely solely on either RGB or depth images, limiting their effectiveness in harnessing a comprehensive range of multi-modal data for 3D applications. To tackle this challenge, we present DR-Point, a tri-modal pre-training framework that learns a unified representation of RGB images, depth images, and 3D point clouds by pre-training with object triplets garnered from each modality. To address the scarcity of such triplets, DR-Point employs differentiable rendering to obtain various depth images. This approach not only augments the supply of depth images but also enhances the accuracy of reconstructed point clouds, thereby promoting the representative learning of the Transformer backbone. Subsequently, using a limited number of synthetically generated triplets, DR-Point effectively learns a 3D representation space that aligns seamlessly with the RGB-Depth image space. Our extensive experiments demonstrate that DR-Point outperforms existing self-supervised learning methods in a wide range of downstream tasks, including 3D object classification, part segmentation, point cloud completion, semantic segmentation, and detection. Additionally, our ablation studies validate the effectiveness of DR-Point in enhancing point cloud understanding.
翻译:最先进的三维模型在识别任务中表现优异,但通常依赖于大规模数据集和明确定义的类别集合。多模态预训练的最新进展通过将三维形状的特征与二维RGB或深度图像对齐,展示了学习三维表征的潜力。然而,现有框架往往仅依赖RGB或深度图像中的单一模态,限制了对多模态数据在三维应用中的全面利用。为应对这一挑战,我们提出DR-Point——一种三模态预训练框架,通过利用各模态获取的目标三元组进行预训练,学习RGB图像、深度图像和三维点云的统一表征。针对此类三元组稀缺的问题,DR-Point采用可微渲染获取多样化深度图像,该方法不仅扩充了深度图像供给,还提升了重建点云的精度,从而促进Transformer骨干网络的表征学习。随后,基于有限数量的合成三元组,DR-Point有效学习到与RGB-深度图像空间无缝对齐的三维表征空间。大量实验表明,DR-Point在三维物体分类、部件分割、点云补全、语义分割及检测等广泛下游任务中均优于现有自监督学习方法。此外,消融研究验证了DR-Point在增强点云理解方面的有效性。