Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.
翻译:输电线路缺陷检测在无人机自动巡检中仍面临挑战,主要源于小尺度缺陷占主导、背景复杂及光照变化。尽管近期取得进展,现有基于RGB的检测器在色彩对比度有限的条件下,难以从视觉相似的背景结构中区分几何特征微弱的缺陷。本文提出CMAFNet,一种通过“先净化后融合”的范式整合RGB外观与深度几何信息的跨模态对齐与融合网络。CMAFNet包含语义重组模块与上下文语义集成框架:前者通过基于字典的特征净化(利用学习得到的码书)抑制模态特异性噪声,同时保留缺陷判别性信息;后者利用部分通道注意力捕获全局空间依赖关系,以增强结构语义推理。净化阶段采用逐位置归一化,通过显式的重构驱动跨模态对齐,确保异质特征在融合前具备统计兼容性。在TLRGBD基准数据集(其中94.5%的实例为小目标)上的大量实验表明,CMAFNet实现了32.2%的mAP@50与12.5%的APs,分别超越最强基线9.8与4.0个百分点。其轻量化变体仅需4.9M参数,在228 FPS下达到24.8%的mAP50,超越了所有基于YOLO的检测器,并以显著更低的计算成本达到了基于Transformer方法的性能水平。