Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.
翻译:无人机(UAV)图像中的实时小目标检测由于特征表示有限以及多尺度融合效果不佳,仍然面临挑战。现有方法未能充分利用频率信息,并依赖于静态卷积操作,这限制了获取丰富特征表示的能力,并阻碍了对深层语义特征的有效挖掘。为解决这些问题,我们提出了EFSI-DETR,一种新颖的检测框架,它将高效的语义特征增强与动态的频率-空间引导相结合。EFSI-DETR包含两个主要组件:(1)动态频率-空间统一协同网络(DyFusNet),它联合利用频率和空间线索进行鲁棒的多尺度特征融合;(2)高效语义特征集中器(ESFC),能够以最小的计算成本实现深层语义提取。此外,采用了一种细粒度特征保留(FFR)策略,在融合过程中融入空间上丰富的浅层特征,以保留对无人机图像中小目标检测至关重要的细粒度细节。在VisDrone和CODrone基准上进行的大量实验表明,我们的EFSI-DETR在保持实时效率的同时,达到了最先进的性能,在VisDrone数据集上的AP和AP$_{s}$分别提升了\textbf{1.6}\%和\textbf{5.8}\%,同时在单个RTX 4090 GPU上获得了\textbf{188} FPS的推理速度。