Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa's training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.
翻译:自然语言处理(NLP)在英语等资源丰富语言中取得了显著进展,但在茨瓦纳语这类低资源语言中仍相对滞后。本文通过提出PuoBERTa——一个专为茨瓦纳语训练的定制化掩码语言模型——来弥补这一差距。我们详细阐述了如何收集、整理及预处理多样化的单语文本,以构建用于PuoBERTa训练的高质量语料库。在现有茨瓦纳语单语资源创建工作的基础上,我们评估了PuoBERTa在多项NLP任务中的表现,包括词性标注(POS)、命名实体识别(NER)以及新闻分类。此外,我们引入了一个新的茨瓦纳语新闻分类数据集,并提供了基于PuoBERTa的初始基准测试结果。本研究证明了PuoBERTa在提升茨瓦纳语等未被充分研究语言的NLP能力方面的有效性,并为未来研究方向奠定了基础。