A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By ``open-vocabulary'', we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed. In addition, we benchmark each task along with the vital components of each method in appendix and updated online at https://github.com/seanzhuh/awesome-open-vocabulary-detection-and-segmentation. Finally, several promising directions are provided and discussed to stimulate future research.

翻译：作为最基础的场景理解任务，目标检测与分割在深度学习时代已取得巨大进展。由于人工标注成本高昂，现有数据集中的标注类别通常规模有限且预先定义，即当前最先进的 fully-supervised 检测器与分割器无法泛化至封闭词汇之外。为突破这一限制，近年来学界对开放词汇检测（OVD）与开放词汇分割（OVS）的关注与日俱增。所谓"开放词汇"，是指模型可对预定义类别之外的目标进行分类。本综述全面梳理了 OVD 与 OVS 的最新进展。我们首先构建分类体系以整合不同任务与方法论。研究发现，弱监督信号的许可与使用方式能有效区分不同方法论，包括：视觉语义空间映射、新颖视觉特征合成、区域感知训练、伪标签、知识蒸馏及迁移学习。所提分类体系适用于不同任务场景，涵盖目标检测、语义/实例/全景分割、3D 与视频理解。我们深入分析了主要设计原则、关键挑战、发展路径、方法论优势与局限。此外，附录中对各任务及其核心组件进行了基准评估，并持续通过 https://github.com/seanzhuh/awesome-open-vocabulary-detection-and-segmentation 更新。最后，提出并讨论了若干具有前景的研究方向，以推动未来研究。