Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these "optimization techniques". Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training "techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.

翻译：多光谱目标检测利用RGB和热红外（TIR）模态，被广泛认为是一项具有挑战性的任务。它不仅需要从两种模态中有效提取特征并采用稳健的融合策略，还必须能够处理RGB与TIR图像之间的光谱差异、空间错位和环境依赖性等问题。这些挑战严重阻碍了多光谱检测系统在不同场景中的泛化能力。尽管已有大量研究试图突破这些限制，但依然难以清晰区分多光谱检测系统的性能提升与各类"优化技巧"的影响。更严峻的是，尽管高性能单模态检测模型不断涌现，目前仍缺乏能够有效将这些模型适配至多光谱检测任务的专门训练技术。同时，缺乏具有公平且统一实验设置的标准化基准，也对评估新方法的有效性构成了重大障碍。为此，我们提出了首个专门用于评估训练"技巧"的公平且可复现的基准。该基准系统性地对现有多光谱目标检测方法进行分类，探究其对超参数的敏感性，并标准化核心配置方案。我们在多个代表性多光谱目标检测数据集上，结合多种骨干网络与检测框架进行了全面评估。此外，我们提出了一种高效且易于部署的多光谱目标检测框架，该框架能够无缝地将高性能单模态模型优化为双模态模型，并融入了我们先进的训练技术。