PoDAR: Power-Disentangled Audio Representation for Generative Modeling

The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.

翻译：音频潜在扩散模型的性能主要受生成器表达能力和潜在空间可建模性的共同制约。尽管近期研究主要聚焦于前者以及音频编解码器重建保真度的提升，但我们证明通过显式因子解缠可显著改善潜在可建模性。本文提出PoDAR（功率解缠音频表示）框架，该框架利用随机功率增强与潜在一致性目标，将信号功率与不变语义内容解耦。这种分解使得潜在空间更易建模，既可加速下游生成模型收敛，又能提升最终整体性能。将PoDAR应用于搭载F5-TTS生成器的Stable Audio 1.0 VAE时，在匹配基准性能的条件下可实现约$2\times$的收敛加速，同时使LibriSpeech-PC数据集上的最终说话人相似度提升0.055、UTMOS提升0.22。此外，将功率分离至专用通道后，可对功率不变内容单独应用无分类器引导，有效将稳定引导范围扩展至更高尺度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

144页ppt《扩散模型》，Google DeepMind Sander Dieleman

专知会员服务

51+阅读 · 2025年11月21日

生成式人工智能的扩散模型概述

专知会员服务

66+阅读 · 2024年12月8日

CoLiDR: 使用聚合解缠表示进行概念学习

专知会员服务

15+阅读 · 2024年8月21日

何恺明等最新步步解构扩散模型，最后竟成经典去噪自编码器

专知会员服务

33+阅读 · 2024年1月28日