Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7\% mAP), FLIR (86.1\% mAP). Our code will be publicly available once the paper is accepted.
翻译:红外目标检测致力于在复杂环境(如黑暗、雪天、雨天)中识别与定位目标,这些环境下可见光成像相机因光照不足而失效。然而,由于红外图像对比度低、边缘信息弱,提取具有区分性的目标特征以实现鲁棒检测具有挑战性。为解决此问题,我们提出了一种新颖的视觉-语言表征学习范式用于红外目标检测。该方法引入具有丰富语义信息的额外文本监督,以引导目标特征与非目标特征的解耦。具体而言,我们提出了语义特征对齐模块,用于将目标特征与对应的文本特征进行对齐。此外,我们开发了目标特征解耦模块,通过最小化特征间的相关性,实现文本对齐的目标特征与非目标特征的解耦。最终,解耦后的目标特征被送入检测头。通过这种方式,利用更具区分性且噪声更少的特征,检测性能得以显著提升。大量实验结果表明,我们的方法在两个基准数据集上均取得了优越性能:M³FD(83.7% mAP)和FLIR(86.1% mAP)。代码将在论文录用后公开。