Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
翻译:数据融合网络在RGB-热成像场景解析任务中展现出巨大潜力。然而,现有研究大多依赖对称的双分支编码器进行异构特征提取与融合,未能充分关注RGB与热成像模态间的固有差异。近期,通过海量无标注数据自监督训练得到的视觉基础模型已证明其能够提取信息丰富且通用的特征,但这一潜力在该领域尚未得到充分利用。本研究通过探索一种充分利用视觉基础模型特征进行RGB-热成像场景解析的可行策略,向这一新兴研究方向迈出一步。具体而言,我们深入探究RGB与热成像模态的独有特性,设计了一种融合视觉基础模型与卷积神经网络的混合非对称编码器。该设计能够更有效地提取互补的异构特征,并通过双路径渐进式策略进行融合。此外,我们引入辅助任务以进一步丰富融合特征的局部语义,从而提升RGB-热成像场景解析的整体性能。集成上述所有组件的HAPNet,在三个广泛使用的公开RGB-热成像场景解析数据集上均取得最优性能,超越了所有现有先进方法。我们相信这一新范式为数据融合场景解析方法的未来发展开辟了新机遇。