In perception, multiple sensory information is integrated to map visual information from 2D views onto 3D objects, which is beneficial for understanding in 3D environments. But in terms of a single 2D view rendered from different angles, only limited partial information can be provided.The richness and value of Multi-view 2D information can provide superior self-supervised signals for 3D objects. In this paper, we propose a novel self-supervised point cloud representation learning method, MM-Point, which is driven by intra-modal and inter-modal similarity objectives. The core of MM-Point lies in the Multi-modal interaction and transmission between 3D objects and multiple 2D views at the same time. In order to more effectively simultaneously perform the consistent cross-modal objective of 2D multi-view information based on contrastive learning, we further propose Multi-MLP and Multi-level Augmentation strategies. Through carefully designed transformation strategies, we further learn Multi-level invariance in 2D Multi-views. MM-Point demonstrates state-of-the-art (SOTA) performance in various downstream tasks. For instance, it achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN, comparable to fully supervised methods. Additionally, we demonstrate its effectiveness in tasks such as few-shot classification, 3D part segmentation and 3D semantic segmentation.
翻译:在感知过程中,多种感官信息被整合以将二维视角的视觉信息映射到三维物体上,这有助于理解三维环境。然而,从不同角度渲染的单一二维视图只能提供有限的局部信息。多视图二维信息的丰富性和价值能够为三维物体提供优越的自监督信号。本文提出了一种新颖的自监督点云表示学习方法MM-Point,该方法由模态内和模态间的相似性目标驱动。MM-Point的核心在于同时实现三维物体与多个二维视图之间的多模态交互与传输。为了更有效地基于对比学习同时执行二维多视图信息的一致跨模态目标,我们进一步提出了Multi-MLP(多层感知机)和多层级增强策略。通过精心设计的转换策略,我们进一步在二维多视图中学习多层级不变性。MM-Point在多种下游任务中展现了最先进的性能。例如,在合成数据集ModelNet40上达到92.4%的峰值准确率,在真实世界数据集ScanObjectNN上达到87.8%的最高准确率,与全监督方法相当。此外,我们还证明了其在少样本分类、三维部件分割和三维语义分割等任务中的有效性。