Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.
翻译:近期,音频感知大语言模型在音频问答任务上展现出强劲性能。然而,现有基准主要涵盖可回答的问题,而忽视了无解问题的挑战——即无法从音频中可靠推断出答案的情形。此类情况在现实场景中十分常见,例如问题可能具有误导性、表述不当或与所提供信息不匹配。为弥补这一空白,我们提出了AQUA-Bench,一个用于音频问题不可回答性评估的基准。它系统性地评估了三种场景:答案缺失检测(正确选项缺失)、不兼容答案集检测(选项与问题在类别上不匹配)以及不兼容音频问题检测(问题与音频无关或缺乏充分依据)。通过对这些案例的评估,AQUA-Bench为模型可靠性提供了严格的衡量标准,并促进了更鲁棒、更可信的音频-语言系统的开发。我们的实验表明,尽管模型在标准的可回答任务上表现出色,但在处理无解问题时常常面临显著挑战,这揭示了当前音频-语言理解研究中的一个盲点。