Hawrami, a dialect of Kurdish, is classified as an endangered language as it suffers from the scarcity of data and the gradual loss of its speakers. Natural Language Processing projects can be used to partially compensate for data availability for endangered languages/dialects through a variety of approaches, such as machine translation, language model building, and corpora development. Similarly, NLP projects such as text classification are in language documentation. Several text classification studies have been conducted for Kurdish, but they were mainly dedicated to two particular dialects: Sorani (Central Kurdish) and Kurmanji (Northern Kurdish). In this paper, we introduce various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers. We use K-nearest Neighbor (KNN), Linear Support Vector Machine (Linear SVM), Logistic Regression (LR), and Decision Tree (DT) to evaluate how well those methods perform the classification task. The results indicate that the Linear SVM achieves a 96% of accuracy and outperforms the other approaches.
翻译:Hawrami作为库尔德语的一种方言,因数据稀缺及使用者逐渐流失而被归类为濒危语言。自然语言处理项目可通过多种途径(如机器翻译、语言模型构建和语料库开发)部分弥补濒危语言/方言的数据可获性问题。同样,文本分类等自然语言处理项目也是语言记录工作的重要组成部分。目前已有若干针对库尔德语的文本分类研究,但这些研究主要集中于两种特定方言:Sorani(中库尔德语)和Kurmanji(北库尔德语)。本文通过使用由两位母语者标注的6,854篇Hawrami文章数据集(涵盖15个类别),构建了多种文本分类模型。我们采用K近邻算法、线性支持向量机、逻辑回归和决策树来评估这些方法在分类任务中的性能。结果表明,线性支持向量机以96%的准确率优于其他方法。