The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4% in LVIS overall setting with 1/10 training time.
翻译:大规模视觉语言模型(如CLIP)被多种方法用于检测未见目标。然而,大多数此类方法需要额外的标注文本或图像进行训练,这在零样本检测场景中不可行。相比之下,基于蒸馏的方法无需额外数据,但存在固有局限:现有方法生成的蒸馏区域偏向基类类别,限制了对新类信息的蒸馏效率;同时,直接使用CLIP原始特征进行蒸馏忽略了CLIP训练数据与检测数据集之间的域差异,难以学习从图像区域到视觉语言特征空间的映射——这是检测未见目标的关键环节。因此,现有基于蒸馏的方法需要过长的训练周期。为解决这些问题,我们提出零样本检测的高效特征蒸馏方法(EZSD)。首先,EZSD通过重新归一化CLIP来适配目标检测域,弥合域差异;其次,利用CLIP生成包含潜在新实例的蒸馏提议,避免蒸馏过度偏向基类;最后,利用语义信息改进回归以进一步提升模型性能。在COCO零样本基准测试中,EZSD以更短的训练周期实现了最优性能,并在LVIS总体设置下以十分之一的训练时间将先前方法性能提升4%。