The human visual perception system demonstrates exceptional capabilities in learning without explicit supervision and understanding the part-to-whole composition of objects. Drawing inspiration from these two abilities, we propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a novel approach that learns to detect objects and understand their compositions without human supervision. HASSOD employs a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations, adaptively determining the number of objects per image. Furthermore, HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures. This additional self-supervised learning task leads to improved detection performance and enhanced interpretability. Lastly, we abandon the inefficient multi-round self-training process utilized in prior methods and instead adapt the Mean Teacher framework from semi-supervised learning, which leads to a smoother and more efficient training process. Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B. Project page: https://HASSOD-NeurIPS23.github.io.
翻译:人类视觉感知系统展现出在没有显式监督的情况下学习并理解物体部分与整体构成关系的卓越能力。受这两种能力的启发,我们提出了层次化自适应自监督目标检测(HASSOD),这是一种无需人工标注即可学习检测物体及其构成关系的新方法。HASSOD采用层次化自适应聚类策略,基于自监督视觉表征将区域分组为物体掩码,并自适应地确定每张图像中的物体数量。此外,HASSOD通过分析掩码间的覆盖关系并构建树状结构,识别物体在构成上的层次级别。这一额外的自监督学习任务不仅提升了检测性能,还增强了模型可解释性。最后,我们摒弃了先前方法中低效的多轮自训练过程,转而采用半监督学习中的Mean Teacher框架,实现了更平滑、更高效的训练流程。在主流图像数据集上的大量实验表明,HASSOD的性能优于现有方法,推动了自监督目标检测领域的最新进展。值得注意的是,我们将LVIS数据集上的Mask AR从20.2提升至22.5,在SA-1B数据集上从17.0提升至26.0。项目主页:https://HASSOD-NeurIPS23.github.io。