We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords as prior information in text prompts. We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder. We adopt a pre-trained Whisper encoder as an audio encoder, and the audio embeddings from the audio encoder are projected to the text embedding space by an adapter layer and concatenated with text embeddings converted from text prompts to form inputs to the decoder. By providing keywords as prior information in the text prompts, we can contextualize our LLM-based ASR system without modifying the model architecture to transcribe ambiguous words in the input audio accurately. Experimental results demonstrate that providing keywords to the decoder can significantly improve the recognition performance of rare and ambiguous words.
翻译:我们开发了一种基于大语言模型(LLM)的自动语音识别(ASR)系统,该系统可通过在文本提示中提供关键词作为先验信息来实现上下文适配。我们采用仅解码器架构,并使用内部研发的PLaMo-100B大语言模型作为解码器,该模型基于以日语和英语文本为主的数据集从头预训练得到。我们采用预训练的Whisper编码器作为音频编码器,通过适配器层将音频编码器输出的音频嵌入向量投影到文本嵌入空间,并与文本提示转换得到的文本嵌入向量进行拼接,共同作为解码器的输入。通过在文本提示中提供关键词作为先验信息,我们能够在无需修改模型架构的情况下,使基于LLM的ASR系统具备上下文适配能力,从而准确转录输入音频中的歧义词汇。实验结果表明,向解码器提供关键词能显著提升对罕见词和歧义词的识别性能。