The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.
翻译:本文的目标是通过少量示例在卫星图像中执行目标检测,从而使用户能够以最少的标注指定任意目标类别。为此,我们探索了开放词汇检测领域的最新方法与思路,并将其应用于遥感领域。我们基于传统两阶段架构开发了一种少样本目标检测器,其中分类模块被替换为基于原型的分类器。利用大规模预训练模型构建类别参考嵌入或原型,并将其与区域提议内容进行比较以实现标签预测。此外,我们提出在可用训练图像上微调原型以提升性能,并学习相似类别(如飞机型号)之间的差异。我们在两个包含挑战性及稀有目标的遥感数据集上进行了广泛评估。同时,我们研究了视觉特征与图像-文本特征(即DINOv2和CLIP)的性能,包括两个专门针对遥感应用定制的CLIP模型。结果表明,视觉特征显著优于视觉-语言模型,因为后者缺乏必要的领域特定词汇。最后,尽管训练参数极少,所开发的检测器在SIMD和DIOR数据集上仍优于全监督及少样本方法。