Audio diffusion models can synthesize high-fidelity music from text, yet achieving fine-grained control over specific musical attributes remains challenging, as their internal mechanisms for representing high-level concepts are poorly understood. In this work, we use activation patching to demonstrate that recent audio diffusion architectures exhibit a semantic bottleneck, where a small, shared subset of consecutive attention layers controls distinct musical concepts, such as the presence of specific instruments, vocals, or genres. Building on this, we systematically evaluate a broad spectrum of steering paradigms, comparing activation steering against prompt-level, score-space, and weight-space interventions, analyzing the interaction between the steering mechanism and the intervention site. Our new benchmark, supported by an extensive user study, demonstrates that localized activation steering establishes a new state-of-the-art in audio concept modulation.
翻译:音频扩散模型能够从文本合成高保真音乐,然而实现对特定音乐属性的精细控制仍具挑战性,因为其表征高层概念的内部机制尚不明确。本研究利用激活修补技术,揭示近期音频扩散架构存在语义瓶颈:一个由连续注意力层组成的小型共享子集控制着不同音乐概念(如特定乐器、人声或音乐流派的存在)。在此基础上,我们系统评估了从提示层、分数空间到参数空间干预的广泛引导范式,对比激活引导与其他干预方式,并分析了引导机制与干预位点间的交互作用。基于大规模用户研究的新基准表明,局部激活引导在音频概念调控中建立了新的最优性能标准。