Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.
翻译:大型音频-语言模型(LALMs)在口语问答任务中展现出强大性能,现有评估主要关注答案准确性及对声学扰动的鲁棒性。然而,此类评估隐含假设语音输入在语义上始终可回答,这一假设在实际交互中常因关键信息缺失而失效。本研究提出一种修复感知的评估框架,明确区分可回答与不可回答的音频输入。我们将可回答性定义为输入本身的属性,并采用语义-声学掩蔽协议构建配对评估条件。基于此框架,我们提出可评估性感知与修复(EAR)分数——一种非补偿性度量指标,联合评估模型在可回答条件下的任务能力及在不可回答条件下的修复行为。在跨多个LALMs的两个口语问答基准测试中,实验揭示了答案准确性与对话可靠性之间的系统性差距:当输入可回答时多数模型表现良好,但大部分模型无法识别语义不可回答性并启动恰当的对话修复。这些发现暴露了当前以准确性为核心的评估实践的局限性,并推动建立将不可回答输入视为修复与持续交互线索的可靠性评估体系。