Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures.
翻译:遥感语义分割必须同时解决图像中地物是什么以及它们位于何处的问题。因此,分割模型不仅要保证大尺度斑块(低频信息)的语义正确性,还要确保斑块间边界(高频信息)的准确定位。然而,现有方法大多严重依赖判别学习,其虽擅长捕捉低频特征,却忽视了其在学习语义分割所需高频特征方面的固有局限性。近期研究表明,扩散生成模型在生成高频细节方面表现优异。我们的理论分析证实,扩散去噪过程显著增强了模型学习高频特征的能力;然而,我们也观察到,仅以原始图像为引导时,这些模型对低频特征的语义推理能力不足。为此,我们融合判别学习与生成学习的优势,提出了用于边界精化的判别与扩散生成学习融合框架。该框架首先使用判别式主干模型生成粗分割图,将此图与原始图像共同输入条件引导网络,以联合学习引导表征,随后通过迭代去噪扩散过程利用该表征对粗分割结果进行精化。在五个遥感语义分割数据集(二值及多类分割)上的大量实验证实,我们的框架能够针对不同判别式架构产生的粗分割结果实现一致的边界精化。