This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning via masked multi-modal reconstruction in Neural Radiance Field (NeRF), namely NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on certain view directions and locations, multi-modal embeddings extracted from corrupted multi-modal input signals, i.e., Lidar point clouds and images, are rendered into projected multi-modal feature maps via neural rendering. Then, original multi-modal signals serve as reconstruction targets for the rendered multi-modal feature maps to enable self-supervised representation learning. Extensive experiments show that the representation learned via NS-MAE shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models on diverse 3D perception downstream tasks (3D object detection and BEV map segmentation) with diverse amounts of fine-tuning labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of both the mechanism of masked autoencoder and neural radiance field. We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.
翻译:本文提出一种统一的自监督预训练框架,用于通过神经辐射场中的掩码多模态重建实现可迁移的多模态感知表示学习,即NeRF监督掩码自编码器。具体而言,在给定视角方向和位置条件下,从受损的多模态输入信号(即激光雷达点云和图像)中提取的多模态嵌入,通过神经渲染映射为投影多模态特征图。随后,原始多模态信号作为渲染多模态特征图的重建目标,实现自监督表示学习。大量实验表明,NS-MAE学习的表示在多种三维感知下游任务(三维目标检测和鸟瞰图语义分割)中,对多模态和单模态(仅相机和仅激光雷达)感知模型展现出显著的可迁移性,且适用于不同规模的标注数据微调。此外,我们通过实证发现,NS-MAE兼具掩码自编码器与神经辐射场机制间的协同效应。本研究期望能为自主智能体更通用的多模态表示学习探索提供启发。