Accurately depicting real-world landscapes in remote sensing (RS) images requires precise alignment between objects and their environment. However, most existing synthesis methods for natural images prioritize foreground control, often reducing the background to plain textures. This neglects the interaction between foreground and background, which can lead to incoherence in RS scenarios. In this paper, we introduce CC-Diff, a Diffusion Model-based approach for RS image generation with enhanced Context Coherence. To capture spatial interdependence, we propose a sequential pipeline where background generation is conditioned on synthesized foreground instances. Distinct learnable queries are also employed to model both the complex background texture and its semantic relation to the foreground. Extensive experiments demonstrate that CC-Diff outperforms state-of-the-art methods in visual fidelity, semantic accuracy, and positional precision, excelling in both RS and natural image domains. CC-Diff also shows strong trainability, improving detection accuracy by 2.04 mAP on DOTA and 2.25 mAP on the COCO benchmark.
翻译:在遥感图像中准确描绘真实世界景观,需要物体与其环境之间的精确对齐。然而,现有的大多数自然图像合成方法优先考虑前景控制,常常将背景简化为简单的纹理。这忽视了前景与背景之间的相互作用,可能导致遥感场景中的不连贯性。本文提出CC-Diff,一种基于扩散模型的、具有增强上下文连贯性的遥感图像生成方法。为了捕捉空间相互依赖性,我们提出了一种顺序流程,其中背景生成以前景实例的合成为条件。我们还采用了不同的可学习查询,以对复杂的背景纹理及其与前景的语义关系进行建模。大量实验表明,CC-Diff在视觉保真度、语义准确性和位置精度方面均优于最先进的方法,在遥感和自然图像领域均表现出色。CC-Diff还显示出强大的可训练性,在DOTA基准上将检测精度提高了2.04 mAP,在COCO基准上提高了2.25 mAP。