Label-efficient LiDAR-based 3D object detection is currently dominated by weakly/semi-supervised methods. Instead of exclusively following one of them, we propose MixSup, a more practical paradigm simultaneously utilizing massive cheap coarse labels and a limited number of accurate labels for Mixed-grained Supervision. We start by observing that point clouds are usually textureless, making it hard to learn semantics. However, point clouds are geometrically rich and scale-invariant to the distances from sensors, making it relatively easy to learn the geometry of objects, such as poses and shapes. Thus, MixSup leverages massive coarse cluster-level labels to learn semantics and a few expensive box-level labels to learn accurate poses and shapes. We redesign the label assignment in mainstream detectors, which allows them seamlessly integrated into MixSup, enabling practicality and universality. We validate its effectiveness in nuScenes, Waymo Open Dataset, and KITTI, employing various detectors. MixSup achieves up to 97.31% of fully supervised performance, using cheap cluster annotations and only 10% box annotations. Furthermore, we propose PointSAM based on the Segment Anything Model for automated coarse labeling, further reducing the annotation burden. The code is available at https://github.com/BraveGroup/PointSAM-for-MixSup.
翻译:基于激光雷达的标签高效三维目标检测目前主要由弱/半监督方法主导。我们并非单一遵循其中某一路线,而是提出MixSup这一更实用的范式,同时利用海量廉价粗粒度标签与有限精确标签实现混合粒度监督。首先观察到点云通常缺乏纹理,导致语义学习困难。然而,点云具有丰富的几何特性且对传感器距离具有尺度不变性,因此其几何属性(如姿态与形状)相对易于学习。基于此,MixSup利用海量粗粒度聚类标签学习语义,并通过少量昂贵框级标签学习精确姿态与形状。我们重新设计了主流检测器中的标签分配机制,使其可无缝集成至MixSup,从而兼具实用性与通用性。在nuScenes、Waymo Open Dataset和KITTI数据集上,采用多种检测器验证了有效性。当使用廉价聚类标注与仅10%框级标注时,MixSup即可达到完全监督性能的97.31%。此外,我们基于Segment Anything Model提出PointSAM以实现自动粗粒度标注,进一步降低标注负担。代码开源于https://github.com/BraveGroup/PointSAM-for-MixSup。