This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 $\sim$ 20 $\times$ faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/
翻译:本文提出一种模态无关且高效的方法——交换前向(SaFa),通过多视角潜在交换联合扩散生成无缝且连贯的长时频谱与全景图。我们首先研究了现有联合扩散方法在基于频谱的音频生成中引起的频谱混叠问题。通过对梅尔频谱与RGB图像的变分自编码器潜在表示进行比较分析,我们发现该问题源于平均算子在频谱去噪过程中对高频成分的过度抑制。为解决此问题,我们提出自循环潜在交换——一种应用于相邻视角重叠区域的帧级双向交换算子。该算子利用相邻子视角的逐步差异化轨迹,自适应增强高频成分并避免频谱失真。此外,为提升非重叠区域的全局跨视角一致性,我们引入参考引导潜在交换——一种单向潜在交换算子,通过提供集中化参考轨迹来同步子视角扩散过程。通过优化交换时机与间隔,我们能够以前向传播方式实现跨视角相似性与多样性的平衡。定量与定性实验表明,在使用U-Net和DiT模型进行音频生成时,SaFa在性能上显著优于现有联合扩散方法乃至基于训练的方法,同时具备有效的长序列适应能力。该方法同样适用于全景图生成,在达到可比性能的同时实现2~20倍的加速效果,并展现出更强的模型泛化能力。更多生成示例详见https://swapforward.github.io/