We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Proven\c{c}al variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Proven\c{c}al. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.
翻译:本文介绍Faetar自动语音识别基准测试,这是一个旨在突破当前低资源语音识别方法极限的基准语料库。Faetar是一种主要通行于意大利的法兰克-普罗旺斯语变体,其缺乏标准正字法,除本基准测试包含的资源外几乎不存在任何现有文本或语音资料,且与其他形式的法兰克-普罗旺斯语差异显著。该语料库源自田野录音,其中多数存在噪声干扰,仅5小时录音配有对应文本转录,且强制对齐的质量参差不齐。语料库另包含20小时未标注语音数据。我们报告了基于前沿多语言语音基础模型的基线结果:通过使用未标注数据集对基础模型进行持续预训练的流程,最佳音素错误率达到30.4%。