In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various downstream 3D tasks. Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set, achieving state-of-the-art results in comparison with previous methods. The code will be available at https://github.com/Nightmare-n/UniPAD.
翻译:在自动驾驶背景下,有效特征学习的重要性已得到广泛认可。尽管传统的3D自监督预训练方法已取得普遍成功,但多数方法仍沿袭为2D图像设计的原始思路。本文提出UniPAD,一种应用3D体积可微渲染的新型自监督学习范式。UniPAD隐式编码3D空间,有助于重建连续的3D形状结构及其2D投影的复杂表观特征。该方法的灵活性使其能无缝集成至2D与3D框架,从而实现对场景更全面的理解。通过在多种下游3D任务上的广泛实验,我们验证了UniPAD的可行性与有效性。该方法分别将基于激光雷达、摄像头及激光雷达-摄像头融合的基线提升了9.1、7.7和6.9 NDS。值得注意的是,在nuScenes验证集上,我们的预训练流程在3D目标检测中达到73.2 NDS,在3D语义分割中达到79.4 mIoU,相较先前方法取得了最先进的结果。代码将发布于https://github.com/Nightmare-n/UniPAD。