Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation

Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. Early efforts focus on boosting the performance for only one task, \emph{e.g.,} fusion or segmentation, making it hard to reach~`Best of Both Worlds'. To overcome this issue, in this paper, we propose a \textbf{M}ulti-\textbf{i}nteractive \textbf{F}eature learning architecture for image fusion and \textbf{Seg}mentation, namely SegMiF, and exploit dual-task correlation to promote the performance of both tasks. The SegMiF is of a cascade structure, containing a fusion sub-network and a commonly used segmentation sub-network. By slickly bridging intermediate features between two components, the knowledge learned from the segmentation task can effectively assist the fusion task. Also, the benefited fusion network supports the segmentation one to perform more pretentiously. Besides, a hierarchical interactive attention block is established to ensure fine-grained mapping of all the vital information between two tasks, so that the modality/semantic features can be fully mutual-interactive. In addition, a dynamic weight factor is introduced to automatically adjust the corresponding weights of each task, which can balance the interactive feature correspondence and break through the limitation of laborious tuning. Furthermore, we construct a smart multi-wave binocular imaging system and collect a full-time multi-modality benchmark with 15 annotated pixel-level categories for image fusion and segmentation. Extensive experiments on several public datasets and our benchmark demonstrate that the proposed method outputs visually appealing fused images and perform averagely $7.66\%$ higher segmentation mIoU in the real-world scene than the state-of-the-art approaches. The source code and benchmark are available at \url{https://github.com/JinyuanLiu-CV/SegMiF}.

翻译：多模态图像融合与分割在自动驾驶及机器人操作中扮演着关键角色。早期研究仅侧重于提升单一任务（例如融合或分割）的性能，难以实现"两全其美"。为解决这一问题，本文提出了一种面向图像融合与分割的**多交互特征学习架构**（SegMiF），并利用双任务相关性促进两项任务性能的协同提升。该架构采用级联结构，包含融合子网络与常用分割子网络。通过巧妙桥接两组件间的中间特征，分割任务习得的知识可有效辅助融合任务，同时受惠的融合网络亦能支撑分割任务更出色地执行。此外，我们构建了层次化交互注意力模块，确保两项任务间所有关键信息的细粒度映射，使模态/语义特征得以充分相互交互。同时引入动态权重因子，自动调整各任务对应权重，平衡交互特征对应关系，突破繁复手动调参的局限。更进一步，我们研制了智能多波束双目光学成像系统，并构建了包含15个标注像素级类别的全时多模态融合分割基准数据集。在多个公开数据集及本基准上的大量实验表明：所提方法能生成视觉悦目的融合图像，并在真实场景下将分割平均交并比（mIoU）较现有最优方法提升7.66%。源代码与基准数据集详见\url{https://github.com/JinyuanLiu-CV/SegMiF}。