The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.
翻译:音频潜在扩散模型的性能主要受生成器表达能力和潜在空间可建模性的共同制约。尽管近期研究主要聚焦于前者以及音频编解码器重建保真度的提升,但我们证明通过显式因子解缠可显著改善潜在可建模性。本文提出PoDAR(功率解缠音频表示)框架,该框架利用随机功率增强与潜在一致性目标,将信号功率与不变语义内容解耦。这种分解使得潜在空间更易建模,既可加速下游生成模型收敛,又能提升最终整体性能。将PoDAR应用于搭载F5-TTS生成器的Stable Audio 1.0 VAE时,在匹配基准性能的条件下可实现约$2\times$的收敛加速,同时使LibriSpeech-PC数据集上的最终说话人相似度提升0.055、UTMOS提升0.22。此外,将功率分离至专用通道后,可对功率不变内容单独应用无分类器引导,有效将稳定引导范围扩展至更高尺度。