Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised object segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over- / under-segmentation and irrelevant objects. Inspired both by the human visual system and by practical applications, we posit that the key missing cue is motion: objects of interest are typically mobile objects. We propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Dataset without using any external data or supervised models. Code is publicly available at https://github.com/YihongSun/MOD-UV.
翻译:具身智能体必须检测并定位感兴趣物体,例如自动驾驶汽车中的交通参与者。针对此任务的边界框监督标注成本极高。因此,先前研究探索了无监督物体分割方法,但在缺乏标注边界框的情况下,像素应如何分组为物体以及哪些物体具有重要性均不明确,导致过度分割/欠分割及无关物体被识别。受人类视觉系统启发并结合实际应用需求,我们认为运动信息是缺失的关键线索:感兴趣物体通常具有移动特性。本文提出MOD-UV——一种仅从无标注视频学习的移动物体检测器。我们首先从运动分割中获取伪标签,但引入了一种新颖的训练范式,逐步发现被运动分割遗漏的小型物体及静态可移动物体。因此,尽管仅从无标注视频学习,MOD-UV仍能通过单张静态图像检测并分割移动物体。实验表明,在不使用任何外部数据或监督模型的情况下,我们在Waymo Open、nuScenes和KITTI数据集的无监督移动物体检测任务中取得了最先进的性能。代码公开于https://github.com/YihongSun/MOD-UV。