Large Audio-Language Models and Multi-Modal Large Language Models have demonstrated strong capabilities in tasks such as Audio Question Answering (AQA), Audio Captioning, and Automatic Speech Recognition (ASR). However, there is growing evidence that these models can hallucinate about the content of the audio. To address this issue, we probe the models' internal states and propose Adaptive Vector Steering (AVS), a method that better grounds generation in audio content. We also identify a strong correlation between output correctness and internal representations. Experiments show consistent performance gains across two models and two benchmarks. On the Audio Hallucination QA dataset, our method boosts the F1-score of Gemma from 0.550 to 0.619 and Qwen from 0.626 to 0.632. Furthermore, our method increases the accuracy of Qwen on MMAU from 0.548 to 0.592, marking an 8% relative increase. To the best of our knowledge, this is the first work to apply vector steering to mitigate hallucination in audio.
翻译:大型音频-语言模型与多模态大语言模型在音频问答、音频描述和自动语音识别等任务中展现出强大能力。然而,越来越多的证据表明这些模型可能对音频内容产生幻觉。为解决此问题,我们探查了模型的内部状态,并提出自适应向量调控方法——一种使生成过程更紧密扎根于音频内容的技术。同时,我们发现输出正确性与内部表征之间存在强相关性。实验表明,该方法在两个模型和两个基准测试中均取得持续的性能提升。在音频幻觉问答数据集上,我们的方法将Gemma的F1分数从0.550提升至0.619,将Qwen的分数从0.626提升至0.632。此外,该方法使Qwen在MMAU数据集上的准确率从0.548提升至0.592,相对提升达8%。据我们所知,这是首次将向量调控技术应用于缓解音频领域幻觉的研究。