Synthetic datasets are widely used for training urban scene recognition models, but even highly realistic renderings show a noticeable gap to real imagery. This gap is particularly pronounced when adapting to a specific target domain, such as Cityscapes, where differences in architecture, vegetation, object appearance, and camera characteristics limit downstream performance. Closing this gap with more detailed 3D modelling would require expensive asset and scene design, defeating the purpose of low-cost labelled data. To address this, we present a new framework that adapts an off-the-shelf diffusion model to a target domain using only imperfect pseudo-labels. Once trained, it generates high-fidelity, target-aligned images from semantic maps of any synthetic dataset, including low-effort sources created in hours rather than months. The method filters suboptimal generations, rectifies image-label misalignments, and standardises semantics across datasets, transforming weak synthetic data into competitive real-domain training sets. Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state-of-the-art translation methods, making rapidly constructed synthetic datasets as effective as high-effort, time-intensive synthetic datasets requiring extensive manual design. This work highlights a valuable collaborative paradigm where fast semantic prototyping, combined with generative models, enables scalable, high-quality training data creation for urban scene understanding.
翻译:合成数据集被广泛用于训练城市场景识别模型,但即使高度逼真的渲染图像与真实影像之间仍存在显著差距。这一差距在适应特定目标域(如Cityscapes)时尤为明显,其中建筑风格、植被类型、物体外观及相机特性等方面的差异限制了下游性能表现。若通过更精细的三维建模来弥合此差距,将需要昂贵的资产与场景设计成本,从而违背低成本标注数据的初衷。为此,我们提出一种新框架,该框架仅利用不完善的伪标签将现成的扩散模型适配至目标域。训练完成后,该框架可根据任意合成数据集的语义图生成高保真且与目标域对齐的图像,包括仅需数小时而非数月构建的低成本数据源。该方法通过筛选次优生成结果、校正图像与标签错位问题,并统一跨数据集的语义标准,将弱合成数据转化为具有竞争力的真实域训练集。在五个合成数据集和两个真实目标数据集上的实验表明,本方法相比最先进的转换方法在分割任务上实现了最高+8.0% mIoU的性能提升,使得快速构建的合成数据集能够达到需大量人工设计的高成本、长周期合成数据集的同等效能。本研究揭示了一种有价值的协同范式:快速语义原型设计与生成模型相结合,可为城市场景理解任务实现可扩展的高质量训练数据创建。