While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.
翻译:尽管大语言模型在各种自然语言处理任务中表现出色,但在执行口语理解任务时,它们要么依赖现成的自动语音识别系统进行转录,要么需配备内置语音模态。本研究聚焦前一种场景,即大语言模型在口语理解任务上的准确性受限于固定ASR系统对语音输入的识别精度。具体而言,我们针对语音意图分类任务展开研究——高词错误率会限制大语言模型理解口语意图的能力。不同于通过设计复杂或专用架构追求高准确率(无论部署成本),我们试图探索:在不根本改变可被多个无关任务共享的底层ASR与大语言模型的前提下,我们能达到何种效果?为此,我们提出将ASR假设的n-best列表(而非仅含错误倾向的1-best假设)输入大语言模型进行提示。我们通过提示工程向大语言模型解释n-best列表概念,并在下游任务中微调低秩适配器。实验表明,基于n-best列表的方法在设备定向语音检测与关键词唤醒任务中均有效,使用n-best列表提示的系统优于仅采用1-best ASR假设的系统,从而为通过大语言模型利用ASR不确定性开发高效语音应用方法奠定了基础。