Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.
翻译:大型音频语言模型(LALMs)在音频理解方面表现出色,但对其在音频信号中的注意力分布却知之甚少。我们提出一种基于指令的向量操控方法,通过对比不同指令提示下的激活值(同时保持音频输入不变)来构建操控向量。通过对LALM注意力的系统探测,我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时序注意力,将其集中在声学相关的区域上。我们进一步证明这种注意力转移具有行为学意义:在受控的三事件场景中,通过读取由操控引起的最大注意力变化的时间位置,可在无需任何训练的情况下恢复查询声音事件的位置,在Qwen2-Audio和Audio Flamingo 3上分别实现了60.87%和68.72%的真实区间重叠率,远高于直接提示(31.84%,46.75%)和随机基线(27.74%)。我们的结果揭示了LALMs中基于指令操控的机制特性,并提供了一种无需训练的探测方法,用于揭示这些模型编码的潜在时序结构。