Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
翻译:在恶劣光照、照明和阴影条件下实现鲁棒的道路场景语义分割,仍然是自动驾驶应用的核心挑战。RGB-热成像融合是一种标准方法,然而现有方法在所有条件下均采用静态融合策略,导致模态特异性噪声在网络中传播。因此,我们提出了CLARITY,它能根据检测到的场景条件动态调整其融合策略。在视觉语言模型先验知识的引导下,网络学习基于光照状态调制各模态的贡献,同时利用对象嵌入进行分割,而非应用固定的融合策略。我们进一步引入了两种机制:一种是保留先前噪声抑制方法错误丢弃的有效暗光对象语义;另一种是分层解码器,它强制跨尺度的结构一致性以锐化细长对象的边界。在MFNet数据集上的实验表明,CLARITY建立了新的最先进水平,实现了62.3%的mIoU和77.5%的mAcc。