The selection of features for text classification is a fundamental task in text mining and information retrieval. Despite being the sixth most widely spoken language in the world, Bangla has received little attention due to the scarcity of text datasets. In this research, we collected, annotated, and prepared a comprehensive dataset of 212,184 Bangla documents in seven different categories and made it publicly accessible. We implemented three deep learning generative models: LSTM variational autoencoder (LSTM VAE), auxiliary classifier generative adversarial network (AC-GAN), and adversarial autoencoder (AAE) to extract text features, although their applications are initially found in the field of computer vision. We utilized our dataset to train these three models and used the feature space obtained in the document classification task. We evaluated the performance of the classifiers and found that the adversarial autoencoder model produced the best feature space.
翻译:文本分类中的特征选择是文本挖掘与信息检索的基础任务。尽管孟加拉语是全球第六大广泛使用的语言,但由于文本数据集的稀缺性,相关研究一直鲜受关注。本研究收集、标注并整理了一个涵盖212,184篇孟加拉语文档的综合数据集,分为七个不同类别,并已公开提供。我们实现了三种深度学习生成模型:长短期记忆变分自编码器(LSTM VAE)、辅助分类器生成对抗网络(AC-GAN)和对抗自编码器(AAE),用于提取文本特征——尽管这些模型最初应用于计算机视觉领域。我们利用该数据集训练这三个模型,并将所获得的特征空间应用于文档分类任务。通过对分类器性能的评估,发现对抗自编码器模型生成了最优的特征空间。