Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.
翻译:音频大语言模型(Audio LLMs)能够实现关于音乐的人性化对话,但近期基准测试表明,尚不清楚这些模型是否真正在聆听音频,还是仅依赖文本推理。本文通过量化每种模态对模型输出的贡献来研究此问题。我们采用基于沙普利值的性能无关评分框架MM-SHAP,该框架可量化各模态对模型预测的相对贡献。我们在MuChoMusic基准上评估了两个模型,发现准确率更高的模型更依赖文本来回答问题;但进一步分析表明,即使音频整体贡献度较低,模型仍能成功定位关键声音事件,这说明音频并未被完全忽略。本研究是MM-SHAP在音频大语言模型中的首次应用,我们希望这项工作能为可解释人工智能与音频领域的未来研究奠定基础。