Unsupervised 3D object detection methods have emerged to leverage vast amounts of data efficiently without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect objects but penalize the detections of static instances during training. Multiple rounds of (self) training are used in which detected static instances are added to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR. Subsequently, object proposals' visual appearances are encoded to distinguish static objects in the foreground and background by selecting static instances that are visually similar to dynamic objects. As a result, static and dynamic foreground objects are obtained together, and existing detectors can be trained with a single training. In addition, we extend 3D object discovery to detection by using object appearance-based cluster labels as pseudo-class labels for training object classification. We conduct extensive experiments on the nuScenes dataset and increase the state-of-the-art performance for unsupervised object discovery, i.e. UNION more than doubles the average precision to 33.9. The code will be made publicly available.
翻译:无监督三维物体检测方法已兴起,其能够在无需人工标注训练标签的前提下高效利用海量数据。现有方法依赖动态物体进行检测学习,但在训练过程中会对静态实例的检测施加惩罚。它们通常采用多轮(自)训练,将检测到的静态实例逐步加入训练目标集;这种提升性能的过程计算代价高昂。为此,我们提出UNION方法。我们利用空间聚类与自监督场景流技术,从激光雷达数据中获取静态与动态物体候选区域。随后,通过编码物体候选区域的视觉外观特征,选取与动态物体视觉特征相似的静态实例,从而区分前景与背景中的静态物体。由此,我们可同时获得静态与动态前景物体,并仅需单次训练即可完成现有检测器的训练。此外,我们通过将基于物体外观的聚类标签作为伪类别标签用于物体分类训练,将三维物体发现任务扩展至检测任务。我们在nuScenes数据集上进行了大量实验,将无监督物体发现的最优性能提升至平均精度33.9——UNION方法使该指标提升超过两倍。代码将公开发布。