Swiss German is a low-resource language represented by diverse dialects that differ significantly from Standard German and from each other, lacking a standardized written form. As a result, transcribing Swiss German involves translating into Standard German. Existing datasets have been collected in controlled environments, yielding effective speech-to-text (STT) models, but these models struggle with spontaneous conversational speech. This paper, therefore, introduces the new SRB-300 dataset, a 300-hour annotated speech corpus featuring real-world long-audio recordings from 39 Swiss German radio and TV stations. It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments and overcomes the limitation of prior sentence-level corpora. We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over previous zero-shot performance metrics. Improvements in word error rate (WER) ranged from 19% to 33%, while BLEU scores increased between 8% and 40%. The best fine-tuned model, large-v3, achieved a WER of 17.1% and a BLEU score of 74.8. This advancement is crucial for developing effective and robust STT systems for Swiss German and other low-resource languages in real-world contexts.
翻译:瑞士德语是一种低资源语言,由多种方言构成,这些方言与标准德语及彼此之间存在显著差异,且缺乏标准化的书面形式。因此,转录瑞士德语通常涉及将其翻译为标准德语。现有数据集均在受控环境下采集,并已构建出有效的语音转文本模型,但这些模型在处理自发对话语音时表现欠佳。为此,本文引入了全新的SRB-300数据集——一个包含300小时标注语音的语料库,收录了来自39个瑞士德语广播及电视台的真实场景长音频录音。该数据集涵盖了所有主要瑞士方言在各种真实环境下的自发语音,突破了以往句子级语料库的局限性。我们在SRB-300数据集上对多个OpenAI Whisper模型进行了微调,相比先前零样本性能指标取得了显著提升:词错误率降低了19%至33%,BLEU分数提高了8%至40%。其中最优的微调模型large-v3实现了17.1%的词错误率和74.8的BLEU分数。这一进展对于开发适用于瑞士德语及其他低资源语言在真实场景中高效鲁棒的语音识别系统具有重要意义。