Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test_Net and Test_Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.
翻译:Audio-LLM将音频模态引入大语言模型,使强大的LLM能够识别、理解和生成音频。然而,在噪声环境下的语音识别过程中,我们观察到音频-LLM存在幻觉和重复问题,导致替换和插入错误。本文提出一种基于转录提示的音频-LLM,通过引入ASR专家作为转录分词器以及一种混合自回归与非自回归解码方法来解决上述问题。在1万小时WenetSpeech普通话语料库上的实验表明,与基线相比,我们的方法在Test_Net和Test_Meeting评估集上相对降低了12.2%和9.6%的字错误率。值得注意的是,我们将评估集上的解码重复率降至零,表明解码重复问题已从根本上得到解决。