NushuRescue: Revitalization of the Endangered Nushu Language with AI

The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nushu and only 35 short examples from NCGold, NushuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nushu. NushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.

翻译：濒危与消亡语言的保护与复兴是一项意义深远的工作，既保护文化遗产，又丰富语言学与人类学等领域。然而，这些语言通常资源匮乏，导致其重建工作劳动密集且成本高昂。女书——一种历史上由中国瑶族女性在父权社会中用于自我表达的稀有文字——正是这一挑战的典型例证。为应对此挑战，我们提出了NushuRescue，一个旨在以极少数据在濒危语言上训练大语言模型（LLMs）的人工智能驱动框架。NushuRescue自动化评估过程并扩展目标语料库，以加速语言复兴进程。作为基础组件，我们开发了NCGold，一个包含500个句子的女书-中文平行语料库，这是首个公开可用的此类数据集。利用GPT-4-Turbo（此前未接触过女书，仅使用NCGold中的35个简短示例），NushuRescue在50个预留句子上实现了48.69%的翻译准确率，并生成了NCSilver——一组包含98个新翻译的不同长度现代中文句子。NCGold与NCSilver的样本已包含在补充材料中。此外，我们还开发了基于FastText和Seq2Seq的模型，以进一步支持女书研究。NushuRescue为濒危语言的复兴提供了一个多功能、可扩展的工具，极大减少了对大量人工输入的依赖。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/