NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.6 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.

翻译：神经场因其能够从二维图像中推理语义、几何与动态等三维视觉信息，在计算机视觉与机器人领域表现卓越。鉴于神经场通过二维图像密集表示三维场景的能力，我们提出疑问：能否扩展其自监督预训练（特别是采用掩码自编码器），从而从带位姿的RGB图像生成有效的三维表示？得益于将Transformer扩展到新型数据模态的巨大成功，我们采用标准三维视觉Transformer来适应NeRF的独特公式化表达。我们将NeRF的体积网格作为Transformer的密集输入，区别于点云等存在信息密度不均与表示不规则性的三维表征。由于将掩码自编码器应用于NeRF这类隐式表示存在困难，我们选择提取显式表示——通过利用相机轨迹进行采样来实现跨域场景的规范化。我们的目标通过以下方式实现：从NeRF的辐射与密度网格中随机遮蔽块，并采用标准三维Swin Transformer重构被遮蔽块。通过此过程，模型可学习完整场景的语义与空间结构。我们在大规模精心策划的带位姿RGB数据（总计超过160万张图像）上对表示进行预训练。预训练完成后，编码器被用于高效的三维迁移学习。我们提出的面向NeRF的自监督预训练方法NeRF-MAE展现出卓越的可扩展性，并在多项具有挑战性的三维任务中提升了性能。利用无标注带位姿二维数据进行预训练，NeRF-MAE在Front3D与ScanNet数据集上的三维物体检测任务中，以超过20%的AP50和8%的AP25绝对性能提升，显著优于自监督三维预训练与NeRF场景理解基线方法。