The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nushu and only 35 short examples from NCGold, NushuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nushu. NushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.
翻译:濒危与消亡语言的保护与复兴是一项意义深远的工作,既保护文化遗产,又丰富语言学与人类学等领域。然而,这些语言通常资源匮乏,导致其重建工作劳动密集且成本高昂。女书——一种历史上由中国瑶族女性在父权社会中用于自我表达的稀有文字——正是这一挑战的典型例证。为应对此挑战,我们提出了NushuRescue,一个旨在以极少数据在濒危语言上训练大语言模型(LLMs)的人工智能驱动框架。NushuRescue自动化评估过程并扩展目标语料库,以加速语言复兴进程。作为基础组件,我们开发了NCGold,一个包含500个句子的女书-中文平行语料库,这是首个公开可用的此类数据集。利用GPT-4-Turbo(此前未接触过女书,仅使用NCGold中的35个简短示例),NushuRescue在50个预留句子上实现了48.69%的翻译准确率,并生成了NCSilver——一组包含98个新翻译的不同长度现代中文句子。NCGold与NCSilver的样本已包含在补充材料中。此外,我们还开发了基于FastText和Seq2Seq的模型,以进一步支持女书研究。NushuRescue为濒危语言的复兴提供了一个多功能、可扩展的工具,极大减少了对大量人工输入的依赖。