Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani--Spanish and Quechua--Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.
翻译:低资源土著语言通常缺乏有效神经机器翻译(NMT)所需的平行语料库。在数据稀缺环境下,合成数据生成为缓解这一局限提供了实用策略。本研究利用高性能多语言翻译模型生成的合成句对,对美洲土著语言的精选平行数据集进行增强。我们分别在仅使用精选数据及合成增强数据上对多语言mBART模型进行微调,并使用chrF++(近期美洲语言NLP共享任务中针对黏着语的主要评估指标)评估翻译质量。进一步采用语言特异性预处理方法,包括正字法规范化和噪声感知过滤,以减少语料库伪影。瓜拉尼语-西班牙语和克丘亚语-西班牙语的翻译实验表明,合成数据增强能持续提升chrF++分数;而针对艾马拉语的诊断性实验则凸显了通用预处理方法对高度黏着性语言的局限性。