Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.
翻译:生成式音频需要细粒度可控的输出,然而现有方法大多需要在特定控制上重新训练模型,或采用计算成本同样高昂的推理时控制(例如引导技术)。通过分析现有基于引导的控制方法存在的瓶颈——特别是因解码器反向传播导致的高单步计算成本——我们提出一种通过选择性TFG与潜在控制头(LatCHs)实现的引导方法,能够以较低计算开销控制潜在音频扩散模型。LatCHs直接在潜在空间操作,避免了昂贵的解码器步骤,且仅需极少的训练资源(700万参数与约4小时训练时长)。基于Stable Audio Open的实验表明,该方法能有效控制强度、音高与节拍(及其组合),同时保持生成质量。我们的方法在精度与音频保真度之间取得了平衡,其计算成本远低于标准的端到端引导方法。演示示例可见于 https://zacharynovack.github.io/latch/latch.html。