Text classification is an important task in Natural Language Processing (NLP), where the goal is to categorize text data into predefined classes. In this study, we analyse the dataset creation steps and evaluation techniques of multi-label news categorisation task as part of text classification. We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites and covers 15 categories of news, press and law texts. We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures, on this newly created dataset. Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models. The best performance is achieved by the BERTbek model, which is a transformer-based BERT model trained on the Uzbek corpus. Our findings provide a good baseline for further research in Uzbek text classification.
翻译:文本分类是自然语言处理中的重要任务,其目标是将文本数据归类至预定义类别。本研究针对多标签新闻分类任务,分析了数据集构建流程与评估技术。我们首先介绍了一个新构建的乌兹别克语文本分类数据集,该数据集采集自10个不同新闻与报刊网站,涵盖15个类别的新闻、报刊及法律文本。同时,我们对该新数据集上从传统词袋模型到深度学习架构的多种模型进行了全面评估。实验结果表明,基于循环神经网络(RNN)和卷积神经网络(CNN)的模型优于基于规则的模型。取得最佳性能的是BERTbek模型——一种基于Transformer、使用乌兹别克语语料库训练的BERT模型。我们的研究结果为乌兹别克语文本分类的进一步研究提供了良好基准。