Object detection in aerial images is a pivotal task for various earth observation applications, whereas current algorithms learn to detect only a pre-defined set of object categories demanding sufficient bounding-box annotated training samples and fail to detect novel object categories. In this paper, we consider open-vocabulary object detection (OVD) in aerial images that enables the characterization of new objects beyond training categories on the earth surface without annotating training images for these new categories. The performance of OVD depends on the quality of class-agnostic region proposals and pseudo-labels that can generalize well to novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework within the student-teacher mechanism employs the CLIP model as an extra omniscient teacher of rich knowledge into the student-teacher self-learning process. By doing so, our approach boosts novel object proposals and classification. Furthermore, we design a dynamic label queue technique to maintain high-quality pseudo labels during batch training and mitigate label imbalance. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 40.0 HM (Harmonic Mean), which outperforms previous methods Detic/ViLD by 26.9/21.1 on the VisDroneZSD dataset.
翻译:航空图像中的目标检测是各类对地观测应用的关键任务,然而现有算法仅能检测预定义的有限目标类别,需要大量带边界框标注的训练样本,且无法检测新类别目标。本文提出面向航空图像的开放词汇目标检测(OVD),可在无需对新类别训练图像进行标注的情况下,实现对地表面训练类别之外新目标的表征。OVD的性能取决于类无关区域提议及伪标签的质量,这些元素需具备对新颖目标类别的良好泛化能力。为同时生成高质量提议与伪标签,我们提出CastDet——一种基于CLIP激活的师生开放词汇目标检测框架。该端到端框架在师生机制中引入CLIP模型作为额外全知教师,将丰富知识注入师生自学习过程,从而增强新颖目标提议与分类能力。此外,我们设计动态标签队列技术,在批次训练过程中维持高质量伪标签并缓解标签失衡问题。我们在多个专为OVD任务构建的现有航空目标检测数据集上开展大量实验。结果表明,CastDet实现了卓越的开放词汇检测性能,例如在VisDroneZSD数据集上达到40.0的调和均值(HM),较Detic/ViLD等先前方法提升26.9/21.1。