Multi-sensor clues have shown promise for object segmentation, but inherent noise in each sensor, as well as the calibration error in practice, may bias the segmentation accuracy. In this paper, we propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features, with the aim of controlling the modal contribution based on relative entropy. We explore semantics among the multimodal inputs in two aspects: the modality-shared consistency and the modality-specific variation. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision. On the one hand, the AF block explicitly dissociates the shared and specific representation and learns to weight the modal contribution by adjusting the proportion, region, and pattern, depending upon the quality. On the other hand, our CFD initially decodes the shared feature and then refines the output through specificity-aware querying. Further, we enforce semantic consistency across the decoding layers to enable interaction across network hierarchies, improving feature discriminability. Exhaustive comparison on eleven datasets with depth or thermal clues, and on two challenging tasks, namely salient and camouflage object segmentation, validate our effectiveness in terms of both performance and robustness.
翻译:多传感器线索在目标分割中展现潜力,但各传感器固有噪声及实际标定误差可能影响分割精度。本文提出通过挖掘跨模态语义引导多模态特征融合与解码的创新方法,旨在基于相对熵控制模态贡献。我们从两个层面探索多模态输入间的语义:模态共享一致性与模态特异性差异。具体而言,我们提出名为XMSNet的新型网络,其包含:(1)全方位注意力融合模块(AF)、(2)由粗到细解码器(CFD)及(3)跨层自监督机制。一方面,AF模块显式分离共享与特定表征,通过依据质量调整比例、区域和模式来学习加权模态贡献。另一方面,CFD先解码共享特征,再通过特异性感知查询优化输出。此外,我们强制解码层间语义一致性以实现网络层级交互,增强特征判别性。在含深度或热线索的十一个数据集及显著目标与伪装目标分割两项挑战性任务上的全面对比验证了本文方法在性能与鲁棒性方面的有效性。