The scarcity of available text corpora for low-resource languages like Albanian is a serious hurdle for research in natural language processing tasks. This paper introduces AlbNews, a collection of 600 topically labeled news headlines and 2600 unlabeled ones in Albanian. The data can be freely used for conducting topic modeling research. We report the initial classification scores of some traditional machine learning classifiers trained with the AlbNews samples. These results show that basic models outrun the ensemble learning ones and can serve as a baseline for future experiments.
翻译:低资源语言(如阿尔巴尼亚语)可用文本语料库的稀缺性,是自然语言处理任务研究的严重障碍。本文介绍了AlbNews,这是一个包含600条带主题标签和2600条未标签的阿尔巴尼亚语新闻标题数据集。该数据可免费用于开展主题建模研究。我们报告了基于AlbNews样本训练的传统机器学习分类器的初始分类得分。结果表明,基础模型优于集成学习模型,并可作为未来实验的基线。