从卫星到街景：一种集成Stable Diffusion与PanoGAN的混合框架用于一致性跨视角合成 (From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis)

Street view imagery has become an essential source for geospatial data collection and urban analytics, enabling the extraction of valuable insights that support informed decision-making. However, synthesizing street-view images from corresponding satellite imagery presents significant challenges due to substantial differences in appearance and viewing perspective between these two domains. This paper presents a hybrid framework that integrates diffusion-based models and conditional generative adversarial networks to generate geographically consistent street-view images from satellite imagery. Our approach uses a multi-stage training strategy that incorporates Stable Diffusion as the core component within a dual-branch architecture. To enhance the framework's capabilities, we integrate a conditional Generative Adversarial Network (GAN) that enables the generation of geographically consistent panoramic street views. Furthermore, we implement a fusion strategy that leverages the strengths of both models to create robust representations, thereby improving the geometric consistency and visual quality of the generated street-view images. The proposed framework is evaluated on the challenging Cross-View USA (CVUSA) dataset, a standard benchmark for cross-view image synthesis. Experimental results demonstrate that our hybrid approach outperforms diffusion-only methods across multiple evaluation metrics and achieves competitive performance compared to state-of-the-art GAN-based methods. The framework successfully generates realistic and geometrically consistent street-view images while preserving fine-grained local details, including street markings, secondary roads, and atmospheric elements such as clouds.

翻译：街景影像已成为地理空间数据采集与城市分析的重要来源，能够提取支持科学决策的宝贵信息。然而，由于卫星影像与街景影像在视觉外观和观测视角上存在显著差异，从对应卫星影像合成街景图像仍面临巨大挑战。本文提出一种混合框架，通过整合扩散模型与条件生成对抗网络，实现从卫星影像生成地理空间一致的街景图像。该框架采用多阶段训练策略，以Stable Diffusion作为双分支架构的核心组件。为增强框架性能，我们集成条件生成对抗网络（GAN）以生成地理空间一致的全景街景。此外，我们设计了一种融合策略，通过协同利用两种模型的优势构建鲁棒表征，从而提升生成街景图像的几何一致性与视觉质量。所提框架在跨视角合成标准基准数据集Cross-View USA（CVUSA）上进行评估。实验结果表明：我们的混合方法在多项评估指标上优于纯扩散方法，并与当前最先进的基于GAN的方法取得相当性能。该框架成功生成了逼真且几何一致的街景图像，同时保留了精细的局部细节，包括道路标线、次级道路以及云层等大气要素。