We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.
翻译:本文提出GlotScript,一个面向低资源文字系统识别的开放资源与工具。GlotScript-R是涵盖7000余种语言已记录文字系统的资源库,通过整合现有文字系统资源构建而成。GlotScript-T是覆盖Unicode 15.0全部161种文字的文字系统识别工具,可对输入文本返回由ISO 15924代码标识的文字分布结果。我们同时展示GlotScript的两项应用:首先证明其可辅助清洗诸如mC4和OSCAR的多语言语料库;其次,利用GlotScript分析GPT-4等若干语言模型的子词切分,揭示各模型对低资源文字和语言的覆盖特征。期待GlotScript成为NLP社区低资源语言研究的实用资源。GlotScript-R与GlotScript-T发布于https://github.com/cisnlp/GlotScript。