Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.
翻译:指代检测旨在定位自然语言所指代的目标,近年来吸引了日益增多的研究兴趣。然而,现有数据集仅限于以相对较小场景中的大型物体为中心的地面图像。本文提出一个面向航拍图像指代检测的大规模挑战性数据集,称为RefAerial。它通过四个特征区别于传统地面指代检测数据集:(1)低但多样化的目标-场景比例;(2)大量目标与干扰物;(3)复杂且细粒度的指代描述;(4)航拍视角下多样且广阔的场景。我们还开发了一种人在回路的指代扩展与标注引擎(REA-Engine),用于高效的半自动化指代对标注。此外,我们观察到现有地面指代检测方法在我们的航拍数据集上表现出严重的性能退化,原因是航拍图像内部或跨图像的固有尺度多样性问题。因此,我们进一步提出一种新颖的尺度全面且敏感(SCS)框架用于航拍图像指代检测。它由混合粒度(MoG)注意力机制和两阶段全面到敏感(CtS)解码策略组成。具体而言,混合粒度注意力机制用于实现尺度全面的目标理解。此外,两阶段全面到敏感解码策略用于实现从粗到细的指代目标解码。最终,所提出的SCS框架在我们的航拍指代检测数据集上取得了卓越性能,甚至对传统地面指代检测数据集也带来了显著的性能提升。