We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript supports cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.
翻译:我们提出GlotScript,一种面向低资源文字系统识别的开源资源与工具。GlotScript-R是一个涵盖超过7,000种语言已考证文字系统的资源库,通过整合现有文字系统资源构建而成。GlotScript-T是一种覆盖所有161种Unicode 15.0文字系统的识别工具,对于输入文本,该工具以ISO 15924代码标识并返回其文字分布。本文还展示了GlotScript的两个应用场景:首先,我们证明GlotScript能够支持多语言语料库(如mC4和OSCAR)的清洗;其次,我们利用GlotScript分析包括GPT-4在内的多种语言模型的分词机制,并揭示各模型对低资源文字与语言的覆盖程度。我们期待GlotScript能成为NLP社区处理低资源语言研究的有用工具。GlotScript-R与GlotScript-T的代码已在https://github.com/cisnlp/GlotScript开源。