Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.
翻译:流匹配变换器在音频分离任务上展现了强大的性能,但其注意力机制的内在动力学仍不透明。我们将成熟的因果干预原理适配为一种确定性的推理时探测协议,用于SAM Audio框架。正交探测揭示了一种双路径文本条件机制:加法注入控制语义身份,而交叉注意力则细化声学结构。我们观察到异步的层级收敛现象:稳定层早期即构建时间骨架,而快速层在采样过程中持续修正伪影。该模型还通过衰减时间分割线索来维持连续流的稳定性。基于这些发现,我们提出层选择性注意力缓存(LSAC)——一种无需训练的加速方法,通过缓存稳定层的注意力计算实现加速。在不同声学复杂度场景下,LSAC能减少约25%的自注意力计算量,且质量损失可忽略不计,相较于简单的步长缩减方案,其质量保持能力提升最高可达6.7倍。