In perception, multiple sensory information is integrated to map visual information from 2D views onto 3D objects, which is beneficial for understanding in 3D environments. But in terms of a single 2D view rendered from different angles, only limited partial information can be provided.The richness and value of Multi-view 2D information can provide superior self-supervised signals for 3D objects. In this paper, we propose a novel self-supervised point cloud representation learning method, MM-Point, which is driven by intra-modal and inter-modal similarity objectives. The core of MM-Point lies in the Multi-modal interaction and transmission between 3D objects and multiple 2D views at the same time. In order to more effectively simultaneously perform the consistent cross-modal objective of 2D multi-view information based on contrastive learning, we further propose Multi-MLP and Multi-level Augmentation strategies. Through carefully designed transformation strategies, we further learn Multi-level invariance in 2D Multi-views. MM-Point demonstrates state-of-the-art (SOTA) performance in various downstream tasks. For instance, it achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN, comparable to fully supervised methods. Additionally, we demonstrate its effectiveness in tasks such as few-shot classification, 3D part segmentation and 3D semantic segmentation.
翻译:在感知过程中,多种感官信息被整合以将二维视图中的视觉信息映射至三维物体,这有利于三维环境中的理解。然而,从不同角度渲染的单一二维视图仅能提供有限的局部信息。多视图二维信息的丰富性与价值能够为三维物体提供优质的自监督信号。本文提出一种新型自监督点云表示学习方法MM-Point,该方法由模态内与模态间相似性目标驱动。MM-Point的核心在于同时实现三维物体与多个二维视图之间的多模态交互与传递。为更有效地基于对比学习完成二维多视图信息的跨模态一致性目标,我们进一步提出Multi-MLP与多层次增强策略。通过精心设计的变换策略,我们进一步在二维多视图中学习多层级不变性。MM-Point在多种下游任务中展现出最先进的性能。例如,其在合成数据集ModelNet40上达到92.4%的最高准确率,在真实世界数据集ScanObjectNN上达到87.8%的最高准确率,与全监督方法性能相当。此外,我们在小样本分类、三维部件分割及三维语义分割等任务中验证了其有效性。