In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.
翻译:本文重点研究将大语言模型应用于自动语音识别时所面临的限制。近期研究采用前缀语言模型架构,直接将语音作为前缀输入大语言模型进行语音识别。我们发现优化语音前缀能提升识别性能,并提出应用RNNT损失函数进行语音前缀调优。该方法结构简洁,既未增加模型复杂度,也未改变推理流程。同时提出基于语言的软提示技术以进一步提升冻结大语言模型的性能。在10种印度语言的实时测试集上的实证分析表明,所提出的语音前缀调优方法在冻结与微调两种大语言模型配置下均能带来性能提升。对10种印度语言的平均识别结果显示:采用RNNT损失的前缀调优方法相比基于微调大语言模型的基线系统,在词错误率上获得12%的相对提升;而基于冻结大语言模型的改进方案相比基础软提示前缀语言模型,实现了31%的相对提升。