Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving
翻译:生成模型在真实感图像合成方面取得了显著进展,其中扩散模型在生成质量和稳定性方面表现优异。近期的多视角扩散模型改善了3D感知的街景生成,但它们难以从文本、BEV地图和物体边界框生成具有地点感知和背景一致的城市场景。这限制了它们在为地点识别任务生成真实样本方面的有效性。为解决这些挑战,我们提出了DiffPlace,一个新颖的框架,通过引入地点ID控制器来实现地点可控的多视角图像生成。该地点ID控制器采用线性投影、感知器Transformer和对比学习,将地点ID嵌入映射到固定的CLIP空间,使模型能够合成具有一致背景建筑物的图像,同时灵活修改前景物体和天气条件。大量实验,包括定量比较和增强训练评估,表明DiffPlace在生成质量和视觉地点识别的训练支持方面均优于现有方法。我们的结果凸显了生成模型在增强场景级和地点感知合成方面的潜力,为提升自动驾驶中的地点识别提供了一种有价值的方法。