This paper shows the effectiveness of 2D backbone scaling and pretraining for pillar-based 3D object detectors. Pillar-based methods mainly employ randomly initialized 2D convolution neural network (ConvNet) for feature extraction and fail to enjoy the benefits from the backbone scaling and pretraining in the image domain. To show the scaling-up capacity in point clouds, we introduce the dense ConvNet pretrained on large-scale image datasets (e.g., ImageNet) as the 2D backbone of pillar-based detectors. The ConvNets are adaptively designed based on the model size according to the specific features of point clouds, such as sparsity and irregularity. Equipped with the pretrained ConvNets, our proposed pillar-based detector, termed PillarNeSt, outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets. Our code shall be released upon acceptance.
翻译:本文展示了二维骨干网络缩放与预训练在基于柱状法的三维目标检测器中的有效性。基于柱状的方法主要采用随机初始化的二维卷积神经网络(ConvNet)进行特征提取,未能充分利用图像领域中骨干网络缩放与预训练带来的优势。为展现点云中缩放大规模网络的能力,本文引入在大规模图像数据集(如ImageNet)上预训练的密集卷积神经网络,作为柱状检测器的二维骨干网络。该卷积神经网络根据模型尺寸,针对点云的稀疏性与不规则性等特性进行自适应设计。通过配备预训练的卷积神经网络,所提出的柱状检测器PillarNeSt在nuScenes和Argoversev2数据集上大幅超越现有的三维目标检测器。代码将在论文接收后开源。