Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
翻译:语音问答(Spoken QA)提出了一个具有挑战性的跨模态问题:在避免级联式自动语音识别系统固有的延迟和错误传播的同时,有效地对齐声学查询与文本知识。本文提出了注意力引导的证据定位(AEG),一种新颖的端到端框架,它利用语音大语言模型(SpeechLLMs)的内部跨模态注意力,在模型的潜在空间中显式地定位并锚定关键证据。为了解决预训练模型中注意力分布弥散的问题,我们提出了学习聚焦于证据(LFE),一种监督微调范式,用于校准模型的注意力机制,以区分查询相关片段与无关上下文。在SQuAD、HotpotQA和MuSiQue数据集上的实验表明,AEG减少了幻觉现象,并取得了显著的效率提升,其性能超越了大规模级联基线模型(Whisper-Large-v3 + Reranker),同时将推理延迟降低了约62%。