FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: https://intellindust-ai-lab.github.io/projects/FSOD-VFM.

翻译：本文提出FSOD-VFM：基于视觉基础模型的小样本目标检测框架，该框架利用视觉基础模型应对小样本目标检测的挑战。FSOD-VFM整合了三个关键组件：用于类别无关边界框生成的通用建议网络（UPN）、用于精确掩码提取的SAM2，以及用于高效适应新目标类别的DINOv2特征。尽管基础模型具备强大的泛化能力，但UPN生成的边界框常存在过度碎片化问题，仅覆盖部分目标区域，导致产生大量小而虚假的正例建议，而非准确完整的目标检测结果。为解决此问题，我们引入了一种新颖的基于图结构的置信度重加权方法。在该方法中，预测边界框被建模为有向图中的节点，通过图扩散操作在网络中传播置信度分数。此重加权过程优化了建议框的评分，为完整目标分配更高置信度，为局部碎片化部分分配更低置信度。该策略提升了检测粒度，并有效减少了虚假正例边界框建议的出现。通过在Pascal-5$^i$、COCO-20$^i$和CD-FSOD数据集上的大量实验，我们证明本方法显著优于现有方案，在不需额外训练的情况下实现了更优性能。值得注意的是，在跨越多数据集与多领域的挑战性CD-FSOD数据集上，我们的FSOD-VFM在10样本设定中达到31.6 AP，大幅超越此前仅达到21.4 AP的无训练方法。代码发布于：https://intellindust-ai-lab.github.io/projects/FSOD-VFM。