Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.
翻译:掩码自编码器已成为三维表征学习的主导范式,在各类下游任务中创造了新的性能基准。现有方法采用固定掩码比率,忽视了多层级表征关联性与内在几何结构,同时依赖于与点云多样性相冲突的逐点重建假设。为解决这些问题,我们提出一种三维表征学习方法Point-SRA,该方法通过自蒸馏与概率建模实现表征对齐。具体而言,我们为MAE分配不同掩码比率以捕获互补的几何与语义信息,同时MeanFlow Transformer通过跨模态条件嵌入实现多样化概率重建。我们的分析进一步揭示MFT中不同时间步的表征同样呈现互补特性。因此,我们在MAE与MFT层级同时提出双重自表征对齐机制。最后,我们设计了流条件微调架构以充分利用MeanFlow学习到的点云分布。Point-SRA在ScanObjectNN上较Point-MAE提升5.37%;在颅内动脉瘤分割任务中,动脉与动脉瘤的平均交并比分别达到96.07%与86.87%;在三维目标检测任务中,Point-SRA以47.3%的AP@50指标超越MaskPoint达5.12%。