Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.
翻译:遥感图像中的开放词汇目标检测通常依赖纯文本提示来指定目标类别,这隐含假设推理时的类别查询能够通过预训练诱导的文本-视觉对齐可靠地实现语义锚定。然而在实际遥感场景中,由于任务和应用特定的类别语义,该假设往往难以成立,导致开放词汇设置下的类别指定存在不稳定性。为突破此局限,我们提出RS-MPOD——一种多模态开放词汇检测框架,通过整合实例锚定的视觉提示、文本提示及其多模态融合,将类别指定机制从纯文本提示扩展至多模态范式。RS-MPOD设计了视觉提示编码器以从示例实例提取基于外观的类别线索,实现无需文本的类别指定;同时配备多模态融合模块,在两种模态信息可用时整合视觉与文本特征。在标准数据集、跨数据集及细粒度遥感基准上的大量实验表明:视觉提示能在语义模糊和分布偏移场景下提供更可靠的类别指定,而多模态提示作为灵活替代方案,在文本语义对齐良好时仍保持竞争优势。