Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Datasets without using any external data or supervised models. Code is available at https://github.com/YihongSun/MOD-UV.
翻译:具身智能体必须检测并定位感兴趣的目标,例如自动驾驶汽车中的交通参与者。为此任务提供边界框形式的监督极其昂贵。因此,先前的研究探索了无监督实例检测与分割,但在缺乏标注框的情况下,如何将像素分组为物体以及哪些物体是感兴趣的并不明确,这导致了过分割/欠分割以及无关物体的检出。受人类视觉系统和实际应用的启发,我们认为无监督检测缺失的关键线索是运动:感兴趣的目标通常是频繁移动的移动物体,其运动可以区分不同的实例。本文提出MOD-UV,一种仅从无标注视频中学习的移动目标检测器。我们从运动分割生成的实例伪标签出发,但引入了一种新颖的训练范式,以逐步发现被运动分割遗漏的小型物体和静态但可移动的物体。因此,尽管仅从无标注视频中学习,MOD-UV能够从单张静态图像中检测并分割移动物体。实证结果表明,在不使用任何外部数据或监督模型的情况下,我们在Waymo Open、nuScenes和KITTI数据集的无监督移动目标检测任务上取得了最先进的性能。代码发布于 https://github.com/YihongSun/MOD-UV。