This paper addresses the challenge of cross-domain few-shot object detection (CD-FSOD), aiming to develop an accurate object detector for novel domains with minimal labeled examples. While transformer-based open-set detectors e.g., DE-ViT~\cite{zhang2023detect} have excelled in both open-vocabulary object detection and traditional few-shot object detection, detecting categories beyond those seen during training, we thus naturally raise two key questions: 1) can such open-set detection methods easily generalize to CD-FSOD? 2) If no, how to enhance the results of open-set methods when faced with significant domain gaps? To address the first question, we introduce several metrics to quantify domain variances and establish a new CD-FSOD benchmark with diverse domain metric values. Some State-Of-The-Art (SOTA) open-set object detection methods are evaluated on this benchmark, with evident performance degradation observed across out-of-domain datasets. This indicates the failure of adopting open-set detectors directly for CD-FSOD. Sequentially, to overcome the performance degradation issue and also to answer the second proposed question, we endeavor to enhance the vanilla DE-ViT. With several novel components including finetuning, a learnable prototype module, and a lightweight attention module, we present an improved Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO). Experiments show that our CD-ViTO achieves impressive results on both out-of-domain and in-domain target datasets, establishing new SOTAs for both CD-FSOD and FSOD. All the datasets, codes, and models will be released to the community.
翻译:本文针对跨域小样本目标检测(CD-FSOD)的挑战,旨在利用极少量标注样本为新颖域开发高精度目标检测器。尽管基于Transformer的开放集检测器(如DE-ViT~\cite{zhang2023detect})在开放词汇目标检测和传统小样本目标检测中均表现出色,能检测训练中未见的类别,我们因此自然提出两个关键问题:1)此类开放集检测方法能否直接泛化至CD-FSOD?2)若不能,当面临显著域差异时,如何增强开放集方法的检测效果?针对第一个问题,我们引入多项指标量化域差异,并构建包含多样化域度量值的CD-FSOD新基准。在该基准上评估了多种最先进(SOTA)开放集目标检测方法,发现其在跨域数据集上存在显著性能衰减,表明直接将开放集检测器用于CD-FSOD存在局限性。为克服性能衰减并回应第二个问题,我们致力于增强原生DE-ViT模型。通过引入包括微调、可学习原型模块和轻量级注意力模块在内的多项创新组件,我们提出改进型跨域视觉Transformer用于CD-FSOD(CD-ViTO)。实验表明,CD-ViTO在跨域和同域目标数据集上均取得优异结果,在CD-FSOD和FSOD任务上均创下新SOTA。所有数据集、代码和模型将向社区公开。