As the most fundamental tasks of computer vision, object detection and segmentation have made tremendous progress in the deep learning era. Due to the expensive manual labeling, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art detectors and segmentors fail to generalize beyond the closed-vocabulary. To resolve this limitation, the last few years have witnessed increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we provide a comprehensive review on the past and recent development of OVD and OVS. To this end, we develop a taxonomy according to the type of task and methodology. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation-based, and transfer learning-based. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D scene and video understanding. In each category, its main principles, key challenges, development routes, strengths, and weaknesses are thoroughly discussed. In addition, we benchmark each task along with the vital components of each method. Finally, several promising directions are provided to stimulate future research.
翻译:作为计算机视觉最基础的任务,目标检测与分割在深度学习时代取得了巨大进展。由于昂贵的人工标注成本,现有数据集的标注类别通常规模较小且预定义,即最先进的检测器和分割器无法推广至封闭词汇之外。为突破这一限制,近年来开放词汇检测(OVD)与分割(OVS)受到越来越多关注。本综述对OVD和OVS的过往及近期发展进行全面回顾。为此,我们根据任务类型与方法论构建分类体系,发现弱监督信号的许可与使用方式能够有效区分不同方法论,包括:视觉-语义空间映射、新型视觉特征合成、区域感知训练、伪标签标注、基于知识蒸馏和基于迁移学习的方法。该分类体系跨不同任务通用,涵盖目标检测、语义/实例/全景分割、3D场景与视频理解。在每个类别中,我们深入探讨其主要原理、关键挑战、发展路径、优势与局限。此外,我们针对每项任务及其各方法的核心组件进行基准测试。最后,提出若干有前景的研究方向以推动未来发展。