Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.
翻译:自动语音识别(ASR)在现代世界中的应用日益广泛。对于英语等拥有大量训练数据的语言,已有许多可用的ASR模型。然而,低资源语言的表现却严重不足。为此,我们创建并发布了一个开放许可且格式化的数据集,包含低资源北印度语言圣经的音频录音。我们设立了多个实验数据划分方案,训练并分析了两个具有竞争力的ASR模型,作为未来基于该数据研究的基线。