In this case study we trained and published a state-of-the-art open-source model for Automatic Speech Recognition (ASR) for German to evaluate the current potential of this technology for the use in the larger context of Digital Humanities and cultural heritage indexation. Along with this paper we publish our wav2vec2 based speech to text model while we evaluate its performance on a corpus of historical recordings we assembled compared against commercial cloud-based and proprietary services. While our model achieves moderate results, we see that proprietary cloud services fare significantly better. As our results show, recognition rates over 90 percent can currently be achieved, however, these numbers drop quickly once the recordings feature limited audio quality or use of non-every day or outworn language. A big issue is the high variety of different dialects and accents in the German language. Nevertheless, this paper highlights that the currently available quality of recognition is high enough to address various use cases in the Digital Humanities. We argue that ASR will become a key technology for the documentation and analysis of audio-visual sources and identify an array of important questions that the DH community and cultural heritage stakeholders will have to address in the near future.
翻译:在本案例研究中,我们训练并发布了一个先进的德语自动语音识别(ASR)开源模型,旨在评估该技术在数字人文与文化遗产索引化宏观背景下的应用潜力。随本文一同发布的还有我们基于wav2vec2构建的语音转文本模型,该模型在我们收集的历史录音语料库中的表现,将与商业云端及专有服务进行对比评估。尽管我们的模型取得了中等效果,但专有云端服务的表现明显更优。研究结果表明,当前可实现的识别率超过90%,但一旦录音音频质量欠佳或使用非常用/过时语言,识别率便会迅速下降。德语方言与口音的多样性更是一大挑战。然而,本文强调当前可获得的识别质量已足以应对数字人文领域的多种应用场景。我们认为ASR将成为视听资料记录与分析的关键技术,并指出数字人文社群与文化遗产利益相关方在近期需要应对的一系列重要问题。