With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.
翻译:随着可控生成技术的快速发展,训练数据合成已成为扩展标记数据集、缓解遥感领域人工标注负担的有效途径。然而,语义掩码控制的复杂性与采样质量的不确定性往往限制了合成数据在下游语义分割任务中的应用价值。针对这些挑战,我们提出面向任务的数据合成框架(TODSynth),该框架包含具有统一三重注意力机制的多模态扩散变换器(MM-DiT),以及由任务反馈引导的即插即用采样策略。基于强大的DiT生成基础模型,我们系统评估了不同控制方案,结果表明:文本-图像-掩码联合注意力机制结合图像与掩码分支的全参数微调,能显著提升遥感语义分割数据合成的有效性,尤其在小样本与复杂场景中表现突出。此外,我们提出控制校正流匹配(CRFM)方法,通过在高塑性阶段前期利用语义损失动态调整采样方向,有效抑制生成图像的不稳定性,弥合合成数据与下游分割任务之间的差距。大量实验证明,本方法持续优于现有最优可控生成方法,可为遥感语义分割生成更稳定且面向任务的合成数据。