In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper- Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS- R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.
翻译:本研究针对两种用于语音转文本的多语言预训练模型——Whisper-Small与Wav2Vec2-XLS-R-300M——在土耳其语上的性能进行了评估。采用土耳其语开源数据集Mozilla Common Voice 11.0版本,对包含少量数据的该数据集进行微调,分别应用于Whisper-Small与Wav2Vec2-XLS-R-300M两种多语言模型。通过对比两模型的语音转文本性能,计算得出Wav2Vec2-XLS-R-300M与Whisper-Small的词错误率(WER)分别为0.28与0.16。此外,还利用未包含在训练及验证数据集中的呼叫中心记录所构建的测试数据,对模型的性能进行了进一步检验。