Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.
翻译:摘要:掩码建模(Masked Modeling, MM)通过重建被掩码的视觉补丁,在各种视觉挑战中展现出广泛成功。然而,由于数据稀疏性和场景复杂性,将MM应用于大规模三维场景仍是一个开放性问题。二维图像中使用的传统随机掩码范式在恢复三维场景掩码区域时,往往会导致高度的模糊风险。为此,我们提出了一种新颖的信息保持重建方法,该方法通过探索局部统计信息来发现并保留代表性的结构化点,从而有效增强三维场景理解中的代理掩码任务。结合渐进式重建方式,我们的方法能够专注于区域几何建模,并减少掩码重建中的模糊性。此外,这种具有渐进掩码比例的场景还可用于自蒸馏其内在空间一致性,即从非掩码区域学习一致表征。通过巧妙结合掩码区域上的信息保持重建与非掩码区域上的一致性自蒸馏,我们构建了一个统一框架,称为MM-3DScene。我们在多个下游任务上进行了全面实验。一致的性能提升(例如,目标检测任务中[email protected]提高6.1%,语义分割任务中mIoU提高2.2%)证明了我们方法的优越性。