Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
翻译:在恶劣光照、照明和阴影条件下的道路场景鲁棒语义分割仍是自动驾驶应用的核心挑战。RGB-热成像融合是一种标准方法,但现有方法对所有场景条件采用统一静态融合策略,导致模态特定噪声在网络中传播。为此,我们提出CLARITY框架,能根据检测到的场景条件动态调整融合策略。在视觉-语言模型(VLM)先验的引导下,网络学习根据光照状态调节各模态的贡献权重,同时利用对象嵌入进行分割,而非采用固定融合策略。我们还引入两种机制:一种保留被现有噪声抑制方法错误丢弃的有效暗目标语义,另一种通过层次化解码器强制跨尺度结构一致性以锐化薄目标边界。在MFNet数据集上的实验表明,CLARITY达到了新的最先进水平(SOTA),实现了62.3%的mIoU和77.5%的mAcc指标。