Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.
翻译:近期,具备音频感知能力的大语言模型(ALLMs)在多种音频理解与推理任务中展现出强劲性能,但其输出仍频繁存在幻觉或过度自信问题。尽管不确定性估计在纯文本大语言模型中已得到广泛研究,但在音频条件生成面临感知模糊性与跨模态对齐等额外挑战的ALLMs领域,该方向仍鲜有探索。本文首次系统性地对ALLMs中的不确定性估计展开实证研究。我们选取五种代表性方法:预测熵、长度归一化熵、语义熵、离散语义熵及P(True),在涵盖通用音频理解、推理、幻觉检测及不可回答问题等多种评估场景中,对多个模型进行基准测试。实验结果揭示两大核心发现:第一,在通用音频推理基准上,基于语义层级与验证的不确定性方法持续优于基于词元层级的基线方法;第二,在可信度导向基准中,不确定性方法的相对有效性呈现出显著的模型依赖性与基准依赖性,表明通用推理场景的结论无法直接迁移至幻觉检测与不可回答问题场景。我们还进一步探索了基于不确定性自适应推理的潜在下游应用。本研究期望为构建可靠、具备不确定性感知能力的音频-语言系统奠定基础。