Text classification has become a crucial task in various fields, leading to a significant amount of research on developing automated text classification systems for national and international languages. However, there is a growing need for automated text classification systems that can handle local languages. This study aims to establish an automated classification system for Pashto text. To achieve this goal, we constructed a dataset of Pashto documents and applied various models, including statistical and neural machine learning models such as DistilBERT-base-multilingual-cased, Multilayer Perceptron, Support Vector Machine, K Nearest Neighbor, decision tree, Gaussian na\"ive Bayes, multinomial na\"ive Bayes, random forest, and logistic regression, to identify the most effective approach. We also evaluated two different feature extraction methods, bag of words and Term Frequency Inverse Document Frequency. The study achieved an average testing accuracy rate of 94% using the MLP classification algorithm and TFIDF feature extraction method in single-label multiclass classification. Similarly, MLP+TFIDF yielded the best results, with an F1-measure of 0.81. Furthermore, the use of pre-trained language representation models, such as DistilBERT, showed promising results for Pashto text classification; however, the study highlights the importance of developing a specific tokenizer for a particular language to achieve reasonable results.
翻译:文本分类已成为各领域的关键任务,催生了大量针对国家及国际语言开发自动化文本分类系统的研究。然而,针对本地语言的自动化文本分类系统需求日益增长。本研究旨在建立普什图语文本的自动分类系统。为实现此目标,我们构建了普什图语文档数据集,并应用统计模型与神经机器学习模型(包括DistilBERT-base-multilingual-cased、多层感知机、支持向量机、K近邻、决策树、高斯朴素贝叶斯、多项式朴素贝叶斯、随机森林和逻辑回归)以识别最有效方法。我们还评估了两种不同的特征提取方法:词袋模型和词频-逆文档频率。研究使用多层感知机分类算法和词频-逆文档频率特征提取方法,在单标签多类分类中取得了94%的平均测试准确率。同样,多层感知机+词频-逆文档频率在F1值达0.81时获得最佳结果。此外,使用预训练语言表征模型(如DistilBERT)在普什图语文本分类中展现出良好前景,但本研究强调,为特定语言开发专用分词器对获得合理结果至关重要。