Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa's training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and paves the way for future research directions.
翻译:自然语言处理(NLP)在英语等资源丰富的语言上取得了显著进展,但在茨瓦纳语等低资源语言方面仍滞后。本文通过提出PuoBERTa——一个专为茨瓦纳语训练的定制化掩码语言模型——来填补这一空白。我们详细阐述了如何收集、精选和准备多样化单语文本,以生成用于PuoBERTa训练的高质量语料库。在先前为茨瓦纳语创建单语资源的工作基础上,我们在多项NLP任务(包括词性标注(POS)、命名实体识别(NER)和新闻分类)上对PuoBERTa进行了评估。此外,我们还引入了新的茨瓦纳语新闻分类数据集,并利用PuoBERTa提供了初始基准测试结果。本研究证明了PuoBERTa在增强茨瓦纳语等研究不足语言的NLP能力方面的有效性,并为未来研究方向奠定了基础。