The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nushu and only 35 short examples from NCGold, NushuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nushu. NushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.
翻译:濒危与消亡语言的保护与复兴是一项具有重要意义的事业,既能保存文化遗产,又能丰富语言学与人类学等领域。然而,这些语言通常属于低资源语言,导致其重建工作费时费力且成本高昂。女书——一种历史上由中国瑶族女性在父权社会中用于自我表达的罕见文字——正是这一挑战的典型例证。为应对此挑战,我们提出了NushuRescue,这是一个由人工智能驱动的框架,旨在利用极少量的数据为濒危语言训练大语言模型(LLMs)。NushuRescue实现了评估自动化并扩展了目标语料库,以加速语言复兴进程。作为基础组件,我们开发了NCGold,一个包含500个句子的女书-中文平行语料库,这是首个公开可用的此类数据集。利用GPT-4-Turbo(该模型此前未接触过女书,仅使用了来自NCGold的35个简短示例),NushuRescue在50个预留测试句上达到了48.69%的翻译准确率,并生成了NCSilver——一套包含98个新翻译的、长度不一的现代中文句子。NCGold与NCSilver的样本已包含在补充材料中。此外,我们还开发了基于FastText和Seq2Seq的模型,以进一步支持女书研究。NushuRescue为濒危语言的复兴提供了一个多功能、可扩展的工具,最大限度地减少了对大量人工投入的需求。