We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.
翻译:我们提出了一种统一框架,将目标检测(OD)与视觉定位(VG)集成于遥感(RS)影像分析中。为支持常规目标检测并为视觉定位任务建立直观先验,我们利用指代表达数据对开放集目标检测器进行微调,将其构建为部分监督的目标检测任务。在第一阶段,我们为每幅图像构建图表示,包含目标查询、类别嵌入和候选框位置。随后,我们的任务感知架构处理此图以执行视觉定位任务。该模型包含:(i)集成空间、视觉和类别特征以生成任务感知候选框的多分支网络;(ii)为目标推理网络,其在候选框间分配概率,并通过软选择机制实现最终指代目标定位。我们的模型在OPT-RSVG和DIOR-RSVG数据集上展现出优越性能,相比现有先进方法取得显著提升,同时保持经典目标检测能力。代码将发布于我们的存储库:\url{https://github.com/rd20karim/MB-ORES}。