We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. Fine-tuning OpenAI Whisper-small yields a Word Error Rate (WER) of 26.74% and a Character Error Rate (CER) of 8.67% on a 538-utterance speaker-disjoint validation set, down from a zero-shot baseline of 159.19% WER and 152.52% CER. A Whisper-base fine-tuned on the same data achieves 44.54% WER and 15.61% CER, confirming that model capacity matters for this low-resource setting. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.
翻译:本文介绍巴尔蒂语(BaltiVoice)——一个16.8小时的巴尔蒂语朗读语音语料库(ISO 639-3: bft)。巴尔蒂语是巴基斯坦吉尔吉特-巴尔蒂斯坦地区使用的藏语系语言,此前无公开可用的自动语音识别资源。该语料库包含10,060条经过验证的本土纳斯塔利克(Nastaliq)脚本发音语句,数据源自Mozilla Common Voice录音。通过微调OpenAI Whisper-small模型,在538条语句的说话者独立验证集上实现了26.74%的词错误率(WER)和8.67%的字符错误率(CER),相比零样本基线(WER 159.19%,CER 152.52%)显著下降。基于相同数据微调的Whisper-base模型取得了44.54%的WER和15.61%的CER,证实了在此低资源场景下模型容量至关重要。数据集、微调模型及实时转录演示均已在HuggingFace平台公开发布。