Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Sch\"onberger et al., 2021), STT4SG-350 (Pl\"uss et al., 2023), and Swiss Parliaments Corpus (Pl\"uss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.
翻译:Whisper(Radford等,2022)是一种先进的自动语音识别(ASR)模型。尽管瑞士德语方言据称未被纳入Whisper的训练数据,但初步实验表明,Whisper能够较好地转录瑞士德语,其输出为标准德语的语音翻译结果。为深入理解Whisper在瑞士德语上的表现,我们通过自动化评估、定性分析和人工评估三种方式对其进行系统评价。我们在三个现有测试集上测试了其性能:SwissDial(Dogan-Schönberger等,2021)、STT4SG-350(Plüss等,2023)和Swiss Parliaments Corpus(Plüss等,2021)。此外,我们基于模拟简短临床访谈为本研究创建了一个新测试集。自动化评估采用词错误率(WER)和BLEU指标;定性分析中,我们探讨了Whisper的优势与不足,并分析部分输出示例;人工评估则通过一项包含28名参与者的问卷调查,要求其对Whisper性能进行评价。所有评估结果表明,若以标准德语输出为目标,Whisper可作为瑞士德语的有效ASR系统。