Detecting small objects in UAV remote sensing images and identifying surface defects in industrial inspection remain difficult tasks. These applications face common obstacles: features are sparse and weak, backgrounds are cluttered, and object scales vary dramatically. Current transformer-based detectors, while powerful, struggle with three critical issues. First, features degrade severely as networks downsample progressively. Second, spatial convolutions cannot capture long-range dependencies effectively. Third, standard upsampling methods inflate feature maps unnecessarily. We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing. Our architecture builds on three novel components. The DCFA module uses dynamic K-sparse attention, cutting complexity from O(N2) down to O(NK), and employs spatial gated linear units for better nonlinear modeling. The DFPN module applies amplitude-normalized upsampling to prevent feature inflation and uses dual-path shuffle convolution to retain spatial details across scales. The FIRC3 module operates in the frequency domain, achieving global receptive fields without sacrificing efficiency. We tested our method extensively on NEU-DET and VisDrone datasets. Results show mAP50 scores of 92.9% and 51.6% respectively-both state-of-the-art. The model stays lightweight with just 11.7M parameters and 41.2 GFLOPs. Strong performance across two very different domains confirms that DFIR-DETR generalizes well and works effectively in resource-limited settings for cross-scene small object detection.
翻译:在无人机遥感图像中检测小目标以及在工业检测中识别表面缺陷仍然是具有挑战性的任务。这些应用面临共同的障碍:特征稀疏且微弱、背景杂乱以及目标尺度变化剧烈。当前基于Transformer的检测器虽然强大,但在三个关键问题上存在不足。首先,随着网络逐步下采样,特征严重退化。其次,空间卷积无法有效捕获长程依赖关系。第三,标准上采样方法不必要地膨胀了特征图。我们提出了DFIR-DETR,通过结合动态特征聚合与频域处理来解决这些问题。我们的架构基于三个新颖的组件。DCFA模块采用动态K稀疏注意力,将复杂度从O(N²)降低至O(NK),并利用空间门控线性单元以实现更好的非线性建模。DFPN模块应用幅度归一化上采样以防止特征膨胀,并使用双路径混洗卷积来跨尺度保留空间细节。FIRC3模块在频域中操作,在不牺牲效率的情况下实现全局感受野。我们在NEU-DET和VisDrone数据集上对我们的方法进行了广泛测试。结果显示,mAP50分数分别达到92.9%和51.6%——均为最先进水平。该模型保持轻量化,仅需11.7M参数和41.2 GFLOPs。在两个截然不同的领域均表现出强劲性能,证实了DFIR-DETR具有良好的泛化能力,并在资源受限的跨场景小目标检测环境中有效工作。