Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.
翻译:音频问答是评估音频-语言模型的关键任务,然而对开放式回答的评估仍然具有挑战性。目前用于音频问答的指标(如BLEU、METEOR和BERTScore)大多改编自自然语言处理和音频描述领域,依赖表层相似性,未能考虑问题上下文、推理过程及部分正确性。为填补文献空白,本研究作出三项贡献。首先,我们提出AQEval基准,以系统化评估音频问答指标。这是该领域首个同类基准,包含由多位标注者针对正确性和相关性进行人工标注的1万条模型回答。其次,我们在AQEval上对现有音频问答指标进行全面分析,揭示其与人类判断相关性较弱的问题,尤其对于较长回答。第三,我们提出新指标——AURA评分,以更好地评估开放式模型回答。在AQEval基准上,AURA取得了与人类评分最先进的相关性,显著优于所有基线方法。通过本工作,我们旨在揭示当前音频问答评估方法的局限性,并推动更优指标的研发。我们同时发布AQEval基准和AURA指标,以支持未来在整体音频问答评估领域的研究。