Whisper-LM：利用语言模型提升低资源语言自动语音识别性能 (Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages)

from arxiv, 26 pages, 6 figures, includes supplementary materials. Will be submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

翻译：随着Whisper等多语言多任务模型的整合，自动语音识别系统无疑取得了显著进展，这些模型已展现出理解和处理多种语言的强大潜力。尽管具备鲁棒性，此类模型在处理少数语言的语音特性时仍存在不足。本研究通过将传统及新型语言模型与微调后的Whisper模型相结合，以提升其在低研究覆盖率语言中的性能，从而弥补这一缺陷。通过对多数据集的严格微调与评估，我们证明了该方法能显著降低词错误率，尤其在低资源场景中。我们的方法不仅利用了Whisper预训练所依赖的海量数据，还通过融入语言模型增强了其语言适应能力。使用统计语言模型时，我们在分布内数据集上最高获得51%的性能提升，在分布外语句上最高提升34%；而大语言模型虽提升幅度有限，但在多样语言环境中均表现出稳定可靠的改进效果。研究结果表明，虽然语言模型整合对所有规模的模型均产生增益，但改进程度存在差异，这凸显了优化语言模型参数的重要性。最后，我们强调在使用基于Transformer的ASR模型报告结果时，选择合适评估参数的必要性。总而言之，本研究通过增强ASR系统的语言知识储备，为开发更具包容性、跨语言性能更优的语音识别技术开辟了道路。本研究的详细技术文档与源代码可在http://www.github.com/hitz-zentroa/whisper-lm获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日