In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.
翻译:在口语理解领域,许多自然语言理解方法通过向大型语言模型提供语音转录文本而非传统书面文本得以应用。现实场景中,自动语音识别系统在输入大型语言模型前会生成输出转录假设,其中固有的错误可能降低后续口语理解任务的效果。本文提出一种利用自动语音识别系统网格输出而非仅依赖最佳假设的方法,旨在捕捉语音歧义并提升口语理解表现。我们针对口语问答和意图分类开展的上下文学习实验表明,借助网格生成的词混淆网络,大型语言模型能够有效抵御含噪语音转录本的干扰,从而缩小使用最佳自动语音识别假设与理论最优上限之间的口语理解性能差距。此外,我们进一步研究了大型语言模型对不同自动语音识别性能条件的鲁棒性,并剖析了上下文学习中最具影响力的关键因素。