Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
翻译:近年来,口语问答领域取得了显著进展。然而,包括大型音频语言模型在内的许多现有方法在处理长音频时仍面临困难。受检索增强生成技术成功的启发,语音相关检索器在长篇幅语音预处理方面展现出潜力,但现有语音检索器的性能仍有不足。为应对这一挑战,我们提出CLSR——一种端到端的对比语言-语音检索器,能够从长音频录音中高效提取与问题相关的片段,以支持下游口语问答任务。与传统语音-文本对比模型不同,CLSR在模态对齐前引入了将声学特征转换为类文本表征的中间步骤,从而更有效地弥合模态间的差异。在四个跨模态检索数据集上的实验结果表明,CLSR在性能上超越了端到端语音检索器以及结合语音识别与文本检索的流水线方法,为推进实用化长篇幅口语问答应用提供了坚实基础。