Panoramic imagery is increasingly used in world-generation, games, and simulation, where users may need not only photorealistic scenes but also stylized and non-photorealistic environments. Large-scale text-to-image diffusion and flow models provide broad style and semantic priors for this goal, but planar image training misaligns them with the wrap-around topology and polar regions of $360^\circ$ panoramas represented in equirectangular projection (ERP). We present SHERPA, a lightweight adaptation framework that combines frequency-selective Circular RoPE, Circular Latent Encoding/Decoding, image-side FFN adapters, and a Dual-Path Training Scheme. Circular RoPE replaces only the seam-sensitive high-frequency horizontal RoPE band with integer-periodic harmonics while preserving the pretrained lower-frequency spectrum. The Paired Panorama Path supervises geometry, while the Unpaired Style Path uses self-supervised yaw consistency for target-free stylized prompts. As a result, SHERPA generates $360^\circ$ panoramas across both photorealistic panorama domains and open-domain stylized prompts.
翻译:全景影像日益广泛地应用于世界生成、游戏和仿真等场景,用户不仅需要逼真的场景,还可能需要风格化乃至非真实感的环境。大规模文本到图像扩散模型与流模型为这一目标提供了丰富的风格和语义先验,但平面图像训练使得这些模型与以等距柱状投影(ERP)表示的360°全景图的环绕拓扑和极地区域不一致。我们提出SHERPA,一种轻量级适配框架,它结合了频率选择性环形旋转位置编码(Circular RoPE)、环形潜在编码/解码、图像侧前馈网络适配器以及双路径训练方案。Circular RoPE仅用整数周期谐波替换对拼接敏感的高频水平RoPE频带,同时保留预训练的低频频谱。成对全景路径监督几何结构,而非成对风格路径则利用自监督偏航一致性实现无目标风格提示的生成。由此,SHERPA能够在逼真的全景域和开放域风格提示两类场景中生成360°全景图。