FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.
翻译:FLEURS为100多种语言提供了n路并行语音,但北库尔德语不在其中,这限制了该语言在自动语音识别和语音翻译任务上的基准测试。我们提出FLEURS-Kobani,它是FLEURS基准测试的一个北库尔德语(ISO 639-3 KMR)口语扩展。FLEURS-Kobani数据集包含5,162个经过验证的语音片段,总计18小时24分钟。数据由31位母语者录制。它将基准测试覆盖范围扩展至一种资源匮乏的库尔德语变体。作为基线,我们针对ASR和端到端语音到文本翻译微调了Whisper v3-large。一种两阶段微调策略(Common Voice到FLEURS-Kobani)取得了最佳ASR性能(测试集上WER 28.11,CER 9.84)。对于端到端语音到文本翻译(KMR到EN),Whisper在测试集上达到8.68 BLEU;我们还报告了基于枢轴的目标及级联式语音到文本翻译设置。FLEURS-Kobani为评估ASR、语音到文本翻译和语音到语音翻译任务提供了首个公开的北库尔德语基准。该数据集以CC BY 4.0许可协议公开发布,供研究使用。