Data availability and quality are major challenges in natural language processing for low-resourced languages. In particular, there is significantly less data available than for higher-resourced languages. This data is also often of low quality, rife with errors, invalid text or incorrect annotations. Many prior works focus on dealing with these problems, either by generating synthetic data, or filtering out low-quality parts of datasets. We instead investigate these factors more deeply, by systematically measuring the effect of data quantity and quality on the performance of pre-trained language models in a low-resourced setting. Our results show that having fewer completely-labelled sentences is significantly better than having more sentences with missing labels; and that models can perform remarkably well with only 10% of the training data. Importantly, these results are consistent across ten low-resource languages, English, and four pre-trained models.
翻译:数据可用性与质量是低资源语言自然语言处理中的主要挑战。相较于高资源语言,低资源语言可获取的数据量显著偏少,且这些数据常存在质量低下、错误频发、无效文本或标注不当等问题。以往研究多聚焦于通过生成合成数据或过滤低质量数据集片段来应对这些挑战。我们则通过系统性测量数据量与数据质量在低资源场景下对预训练语言模型性能的影响,对这些因素进行了更深入的探究。结果表明,拥有较少但标注完整的句子,显著优于拥有更多但存在标注缺失的句子;且模型仅使用10%的训练数据即可获得出色性能。重要的是,这一结论在十种低资源语言、英语以及四种预训练模型中保持稳定一致。