Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.
翻译:低资源语言由于缺乏数字化且结构化的语言数据,给自然语言处理带来了独特挑战。为弥补这一空白,GhanaNLP项目开发并整理出41,513个平行句对,涵盖特维语、芳蒂语、埃维语、加语和库萨阿尔语——这些语言在加纳广泛使用,却在数字空间中代表性不足。每个数据集包含当地语言与英语之间精心对齐的句对,数据由人类专业人士收集、翻译并标注,同时补充标准结构元数据以确保一致性和可用性。这些语料库旨在支持研究、教育和商业应用,包括机器翻译、语音技术和语言保护。本文阐述了数据集的创建方法、结构、预期用途和评估,以及其在Khaya AI翻译引擎等实际应用中的部署情况。总体而言,本研究通过为非洲语言提供包容且可及的语言技术,为推进AI民主化的更广泛努力做出了贡献。