Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.
翻译:遵循指令的音频语言模型(ALMs)可通过添加显式声学线索进行增强,然而当原始音频已存在时,这些线索是否以接地方式被使用仍不明确。我们通过从标准化eGeMAPS副语言特征集中提取六个可解释的声学概念标记来研究语音情感识别(SER)中的这一问题。这些标记概括了能量、音高、动态、亮度、共振峰和音质特征,在保持音频输入不变的情况下附加到文本提示中。在广泛使用的FAU-Aibo和IEMOCAP基准测试中,对齐标记可提升未加权平均召回率(UAR),而打乱、矛盾或损坏的标记相对于对齐标记会降低性能并导致分类混淆向中性偏向转移。值得注意的是,在强标记扰动下预测结果并未崩溃,表明模型对符号线索通道敏感,但仍部分锚定于音频信号。我们认为,仅通过标记干预的方法为探测基于音频线索的使用、鲁棒性及ALM情感计算的可解释性提供了实用途径。