While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.
翻译:尽管大语言模型在多种自然语言处理任务中表现出色,但要在口语理解任务上取得良好性能,它们要么依赖现成的自动语音识别系统进行转录,要么内置语音模态。本研究聚焦于前一种场景,即大语言模型在口语理解任务上的准确性受限于固定自动语音识别系统对语音输入的识别精度。具体而言,我们针对语音-意图分类任务展开研究,其中高词错误率会限制大语言模型理解口语意图的能力。不同于通过设计复杂或专用架构以追求高精度(而不考虑部署成本),我们试图探究:在不对底层自动语音识别系统和大语言模型(可被多个无关任务共享)进行实质性改动的前提下,能实现多高的性能。为此,我们提出向大语言模型提供自动语音识别假设的n-best列表(而非仅提供易出错的1-best假设),并通过提示工程向大语言模型解释n-best列表的概念,随后在下游任务上对低秩适配器进行微调。我们的n-best列表方法在设备定向语音检测任务和关键词唤醒任务中均被证明有效——使用n-best列表提示的系统性能优于使用1-best自动语音识别假设的系统,这为通过大语言模型高效挖掘自动语音识别不确定性以应用于语音任务铺平了道路。