Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: https://github.com/ValiantDiligent/UAV-DETR
翻译:无人机目标检测(UAV-OD)已在多种场景中得到广泛应用。然而,现有的大多数UAV-OD算法依赖于人工设计的组件,需要进行大量调优。不依赖此类人工设计组件的端到端模型主要针对自然图像设计,对无人机图像的检测效果欠佳。为解决这些挑战,本文提出了一种专为无人机图像定制的高效检测Transformer(DETR)框架,即UAV-DETR。该框架包含一个具有频率增强的多尺度特征融合模块,能够捕获不同尺度下的空间和频率信息。此外,本文还提出了一个频率聚焦下采样模块,以在下采样过程中保留关键的空间细节。同时,开发了一个语义对齐与校准模块,用于对齐和融合来自不同融合路径的特征。实验结果证明了我们方法在多种无人机图像数据集上的有效性和泛化能力。在VisDrone数据集上,我们的方法相较于基线将AP提升了3.1%,$\text{AP}_{50}$提升了4.2%。在UAVVaste数据集上也观察到了类似的性能提升。项目页面:https://github.com/ValiantDiligent/UAV-DETR