In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5\%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.
翻译:在快速发展的口语问答领域,大型语言模型的集成已成为一项变革性进展。传统方法通常需要使用独立的模型进行问题音频转录和答案选择,导致显著的资源消耗和错误累积。为应对这些挑战,我们探索了端到端方法在医疗领域口语问答中的有效性。本研究提出了一种新颖的零样本口语问答方法,并与传统级联系统进行了对比。通过在包含8项医疗任务和48小时合成音频的新开放基准上进行全面评估,我们证明该方法所需资源比结合13亿参数LLM与15.5亿参数ASR模型的系统减少高达14.7倍,同时将平均准确率提升0.5%。这些发现凸显了端到端方法在资源受限场景下应用于口语问答的潜力。