Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.
翻译:近期音频感知大语言模型在音频问答任务中展现了强劲性能。然而现有基准测试主要覆盖可回答的问题,忽视了无答案问题的挑战——即无法从音频中推断出可靠答案的情况。这类问题在现实场景中普遍存在,例如问题可能具有误导性、表述不当或与音频信息不兼容。为弥补这一缺口,我们提出AQUA-Bench(音频问题不可答性评估基准),系统评估三种情景:缺失答案检测(正确选项缺失)、不兼容答案集检测(选项与问题在类别上不匹配)以及不兼容音频问题检测(问题与音频无关或缺乏充分依据)。通过评估这些案例,AQUA-Bench为衡量模型可靠性提供了严格标准,并推动开发更鲁棒、更可信的音频语言系统。实验表明,尽管模型在标准可答任务中表现优异,但在处理不可答问题时往往面临显著挑战,揭示了当前音频语言理解的盲区。