This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.
翻译:本文探索将大型语言模型(LLMs)集成到自动语音识别(ASR)系统中以提升转录准确性的方法。LLMs凭借其上下文学习能力和指令遵循行为日益成熟,在自然语言处理(NLP)领域引起了广泛关注。本研究重点考察利用LLM的上下文学习能力来增强ASR系统性能的潜力——当前ASR系统仍面临环境噪声、说话人口音和复杂语言语境等挑战。我们基于Aishell-1和LibriSpeech数据集设计实验,以ChatGPT和GPT-4作为LLM能力的基准参照。遗憾的是,初步实验未取得理想结果,表明将LLM上下文学习应用于ASR任务具有复杂性。尽管进一步探索了不同设置和模型,但LLM生成的修正句子往往导致更高的词错误率(WER),揭示了LLM在语音应用中的局限性。本文详细阐述了这些实验过程、结果及其启示,证实在现阶段利用LLM的上下文学习能力来纠正语音识别转录中的潜在错误仍是一项充满挑战的任务。