Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English-Efik translation, leveraging a small-scale, community-curated parallel corpus of 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English-Efik and 31.21 for Efik-English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.
翻译:低资源语言作为人类历史的宝贵宝库,保存着文化与知识的多样性。尽管其意义重大,这些语言在现代自然语言处理系统中仍基本处于缺失状态。虽然针对斯瓦希里语、约鲁巴语和阿姆哈拉语等广泛使用的非洲语言已取得进展,但埃菲克语等较小的土著语言在机器翻译研究中仍持续面临代表性不足的问题。本研究评估了最先进的多语言神经机器翻译模型在英埃菲克翻译任务上的有效性,利用了一个由社区构建的小规模平行语料库,包含13,865个句对。我们在该数据集上分别对mT5多语言模型和NLLB200模型进行了微调。NLLB-200的表现优于mT5,在英埃菲克翻译中取得26.64的BLEU分数和51.04的chrF分数,在埃菲克-英翻译中取得31.21的BLEU分数和47.92的chrF分数,表明其在流畅度与语义保真度方面均有提升。我们的研究结果证明了为低资源语言开发实用机器翻译工具的可行性,并强调了包容性数据实践与文化情境化评估在推进公平自然语言处理发展中的重要性。