This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.
翻译:本文介绍了Killkan,这是首个面向厄瓜多尔土著语言基奇瓦语的自动语音识别(ASR)数据集。基奇瓦语是一种极度低资源濒危语言,在Killkan之前,该语言缺乏可用于自然语言处理应用开发的数据资源。该数据集包含约4小时的音频数据,并配有转录文本、西班牙语翻译以及通用依存格式的形态句法标注。音频数据来源于基奇瓦语公共广播节目。本文还对该数据集进行了语料库语言学分析,特别关注基奇瓦语的黏着形态特征及其与西班牙语的频繁语码转换现象。实验表明,尽管数据集规模较小,但仍可基于该数据开发出首个具有可靠质量的基奇瓦语ASR系统。数据集、ASR模型及其开发代码将公开发布。因此,本研究为低资源语言及其社区的资源建设与应用提供了积极示范。