Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
翻译:开放词汇目标检测(OVD)模型通过检测任意文本查询中的目标,展现出卓越的灵活性。然而,在遥感(RS)等专业领域中,其零样本性能常因自然语言固有的歧义性而受限,影响了关键下游应用。例如,OVD模型可能难以区分“渔船”与“游艇”等细粒度类别,因为它们的嵌入表示相似且往往难以分离。这可能导致产生无关检测结果,从而阻碍特定用户目标(如监测非法捕鱼)的实现。为解决此问题,我们提出一种级联方法,将大规模预训练OVD模型的广泛泛化能力与轻量级少样本分类器相结合。我们的方法首先利用零样本模型生成高召回率的目标候选框,随后通过一个在少量用户标注样本上实时训练的紧凑分类器对这些候选框进行高精度优化——这大幅降低了遥感图像标注的高昂成本。我们框架的核心是FLAME,一种一步式主动学习策略,用于选择最具信息量的训练样本。FLAME通过密度估计即时识别决策边界附近的不确定边缘候选样本,随后进行聚类以确保样本多样性。这种高效采样技术无需昂贵的全模型微调即可实现高精度,并支持在不到一分钟内完成即时自适应,速度显著优于当前最先进方案。我们的方法在遥感基准测试中持续超越最先进性能,为将基础模型适配至特定用户需求建立了一个实用且资源高效的框架。