PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.

翻译：本文提出PashtoCorp——一个包含12.5亿词汇的普什图语语料库。普什图语作为拥有6000万使用者的语言，在自然语言处理领域长期处于严重代表性不足的状态。本语料库整合了39个数据源，涵盖7个HuggingFace数据集和32个专门构建的网络爬虫数据，并通过可复现的处理流程进行加工，包括阿拉伯文字符分词、SHA-256去重和质量过滤。该语料库包含281万篇文档共计12.5亿词汇，规模达到OSCAR普什图语子集的40倍，是此前最大专用普什图语料库的83倍。基于PashtoCorp对XLM-R-base模型进行持续掩码语言建模预训练，使留出困惑度降低25.1%（8.08→6.06）。在WikiANN普什图语命名实体识别任务中，预训练模型将实体F1分数相对提升10%（19.0%→21.0%），训练方差降低近7倍；最大增益出现在50个训练语句场景（+27%），且PashtoCorp覆盖了WikiANN 97.9%的实体词汇。在Belebele普什图语阅读理解任务中，Gemma-3n模型达到64.6%准确率，这是该基准测试中首个公开发表的普什图语大语言模型基线。通过留一法源数据消融实验发现，维基百科（占文档数0.7%）对命名实体识别最为关键：仅移除该数据源就会导致实体F1分数下降47%。语料库数据、训练模型及代码已发布于https://huggingface.co/datasets/ihanif/pashto-corpus、https://huggingface.co/ihanif/xlmr-pashto 与 https://github.com/ihanif/pashto-corpus。