Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.
翻译:音频扩散模型能够根据文本合成高保真音乐,但其表征高层概念的内在机制仍不甚明晰。本研究采用激活修补技术证明,在先进音频扩散架构中,特定语义音乐概念(如特定乐器存在、人声或流派特征)由注意力层的一个小型共享子集所控制。进一步研究表明,在这些层级应用对比激活加法与稀疏自编码器能够实现对生成音频的更精确控制,这印证了专业化现象的直接效益。通过导向已识别层级的激活,我们能够以高精度调整特定音乐元素,例如调节节奏或改变曲目情绪。