The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.
翻译:大语言模型(LLMs)的引入推动了自然语言处理(NLP)的发展,但其效果在很大程度上依赖于预训练资源。这在僧伽罗语等低资源语言中尤为明显,此类语言面临两大主要挑战:缺乏大规模训练数据和有限的基准测试数据集。为此,本研究提出了NSINA——一个包含来自热门僧伽罗语新闻网站逾50万篇文章的综合新闻语料库,并附带三项NLP任务:新闻媒体识别、新闻类别预测及新闻标题生成。NSINA的发布旨在解决将LLMs适配至僧伽罗语时所面临的挑战,为提升僧伽罗语NLP水平提供宝贵资源与基准测试。NSINA是迄今为止规模最大的僧伽罗语新闻语料库。