Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.
翻译:随着Whisper等多语言多任务模型的整合,自动语音识别系统无疑取得了显著进展,这些模型已展现出理解和处理多种语言的强大潜力。尽管具备鲁棒性,此类模型在处理少数语言的语音特性时仍存在不足。本研究通过将传统及新型语言模型与微调后的Whisper模型相结合,以提升其在低研究覆盖率语言中的性能,从而弥补这一缺陷。通过对多数据集的严格微调与评估,我们证明了该方法能显著降低词错误率,尤其在低资源场景中。我们的方法不仅利用了Whisper预训练所依赖的海量数据,还通过融入语言模型增强了其语言适应能力。使用统计语言模型时,我们在分布内数据集上最高获得51%的性能提升,在分布外语句上最高提升34%;而大语言模型虽提升幅度有限,但在多样语言环境中均表现出稳定可靠的改进效果。研究结果表明,虽然语言模型整合对所有规模的模型均产生增益,但改进程度存在差异,这凸显了优化语言模型参数的重要性。最后,我们强调在使用基于Transformer的ASR模型报告结果时,选择合适评估参数的必要性。总而言之,本研究通过增强ASR系统的语言知识储备,为开发更具包容性、跨语言性能更优的语音识别技术开辟了道路。本研究的详细技术文档与源代码可在http://www.github.com/hitz-zentroa/whisper-lm获取。