The past few years have witnessed the great success and prevalence of self-supervised representation learning within the language and 2D vision communities. However, such advancements have not been fully migrated to the field of 3D point cloud learning. Different from existing pre-training paradigms designed for deep point cloud feature extractors that fall into the scope of generative modeling or contrastive learning, this paper proposes a translative pre-training framework, namely PointVST, driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images. More specifically, we begin with deducing view-conditioned point-wise embeddings through the insertion of the viewpoint indicator, and then adaptively aggregate a view-specific global codeword, which can be further fed into subsequent 2D convolutional translation heads for image generation. Extensive experimental evaluations on various downstream task scenarios demonstrate that our PointVST shows consistent and prominent performance superiority over current state-of-the-art approaches as well as satisfactory domain transfer capability. Our code will be publicly available at https://github.com/keeganhk/PointVST.
翻译:过去数年间,自监督表征学习在语言与2D视觉领域取得了巨大成功并得到广泛应用。然而,这类进展尚未完全迁移至3D点云学习领域。不同于现有面向深度点云特征提取器的预训练范式(其范畴涵盖生成式建模或对比学习),本文提出一种迁移式预训练框架PointVST,该框架基于一种新颖的自监督前置任务——将3D点云跨模态转换为对应的多样化2D渲染图像。具体而言,我们首先通过插入视点指示器推导出视点条件化的逐点嵌入,然后自适应聚合视点特定全局码字,该码字可进一步输入后续2D卷积转换头以生成图像。在各类下游任务场景上的广泛实验评估表明,我们的PointVST相较于当前最先进方法展现出持续且显著的性能优势,同时具备令人满意的领域迁移能力。我们的代码将公开发布于https://github.com/keeganhk/PointVST。