Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion

The main challenge for small object detection algorithms is to ensure accuracy while pursuing real-time performance. The RT-DETR model performs well in real-time object detection, but performs poorly in small object detection accuracy. In order to compensate for the shortcomings of the RT-DETR model in small object detection, two key improvements are proposed in this study. Firstly, The RT-DETR utilises a Transformer that receives input solely from the final layer of Backbone features. This means that the Transformer's input only receives semantic information from the highest level of abstraction in the Deep Network, and ignores detailed information such as edges, texture or color gradients that are critical to the location of small objects at lower levels of abstraction. Including only deep features can introduce additional background noise. This can have a negative impact on the accuracy of small object detection. To address this issue, we propose the fine-grained path augmentation method. This method helps to locate small objects more accurately by providing detailed information to the deep network. So, the input to the transformer contains both semantic and detailed information. Secondly, In RT-DETR, the decoder takes feature maps of different levels as input after concatenating them with equal weight. However, this operation is not effective in dealing with the complex relationship of multi-scale information captured by feature maps of different sizes. Therefore, we propose an adaptive feature fusion algorithm that assigns learnable parameters to each feature map from different levels. This allows the model to adaptively fuse feature maps from different levels and effectively integrate feature information from different scales. This enhances the model's ability to capture object features at different scales, thereby improving the accuracy of detecting small objects.

翻译：小目标检测算法的主要挑战在于在追求实时性能的同时保证检测精度。RT-DETR模型在实时目标检测方面表现优异，但在小目标检测精度方面存在不足。为弥补RT-DETR模型在小目标检测中的缺陷，本研究提出两项关键改进。首先，RT-DETR采用的Transformer仅接收Backbone最后一层特征作为输入，这意味着Transformer输入仅包含深度网络最高抽象级别的语义信息，而忽略了边缘、纹理或颜色梯度等对低层抽象中小目标定位至关重要的细节信息。仅包含深层特征可能引入额外背景噪声，对小目标检测精度产生负面影响。针对此问题，我们提出细粒度路径增强方法，通过向深度网络提供细节信息，更精确地定位小目标，使Transformer输入同时包含语义信息与细节信息。其次，在RT-DETR中，解码器将不同层级特征图以等权重拼接后作为输入，但该操作难以有效处理不同尺寸特征图所捕获的多尺度信息间的复杂关系。因此，我们提出自适应特征融合算法，为来自不同层级的每个特征图分配可学习参数，使模型能够自适应融合不同层级特征图，有效整合不同尺度的特征信息。此举增强了模型捕获不同尺度目标特征的能力，从而提升小目标检测精度。