Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency refillment. LiteFocus demonstrates substantial reduction on inference time with diffusion-based TTA model by 1.99x in synthesizing 80-second audio clips while also obtaining improved audio quality.
翻译:潜在扩散模型在音频生成领域展现出有前景的结果,相较于传统方法取得了显著进展。然而,尽管其在短音频片段上表现优异,但在扩展到更长音频序列时面临挑战。这些挑战源于模型的自注意力机制以及主要在10秒片段上进行的训练,这使得在不进行适配的情况下扩展到更长音频变得复杂。针对这些问题,我们提出了一种新颖的方法LiteFocus,以增强现有音频潜在扩散模型在长音频合成中的推理能力。通过观察自注意力中的注意力模式,我们采用了一种双重稀疏形式进行注意力计算,称为同频聚焦与跨频补偿。该方法在同频约束下缩减了注意力计算,同时通过跨频补充提升了音频质量。实验表明,LiteFocus在使用基于扩散的TTA模型合成80秒音频片段时,推理时间显著降低了1.99倍,同时获得了更优的音频质量。