Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.
翻译:开放词汇航空检测(OVAD)与遥感视觉定位(RSVG)已成为航空场景理解的两个关键范式。然而,每种范式在独立运作时都存在固有的局限性:OVAD仅限于粗粒度的类别级语义理解,而RSVG在结构上局限于单目标定位。这些限制使得现有方法无法同时支持丰富的语义理解和多目标检测。为解决这一问题,我们提出了OTA-Det,这是首个将两种范式桥接至统一架构的框架。具体而言,我们引入了一种任务重构策略,统一了任务目标和监督机制,使得能够利用来自两种范式的数据集进行联合训练,并获得密集的监督信号。此外,我们提出了一种密集语义对齐策略,在从整体描述到个体属性的多个粒度上建立显式对应关系,从而实现细粒度的语义理解。为确保实时效率,OTA-Det基于RT-DETR架构构建,通过引入多个高效模块,将其从闭集检测扩展至开放文本检测,在涵盖OVAD和RSVG任务的六个基准测试中取得了最先进的性能,同时保持34 FPS的实时推理速度。