This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.
翻译:本文介绍了一个开源软件库,该库提供了一套有限状态转换器(FST)组件及相关工具,用于操作使用波斯-阿拉伯文字的语言书写系统。其操作涵盖多种层次的文字规范化,包括视觉不变性保持操作(这些操作涵盖并超越了标准Unicode规范化形式),以及根据十一种当代语言(来自不同语系)的区域正字法修改字符视觉外观的转换。该库还提供了基于FST的简单罗马化与音译功能。此外,我们尝试通过提供从Unicode码点到使用这些文字的语言的一对多映射,对波斯-阿拉伯文字的类型进行形式化描述。尽管我们的工作聚焦于阿拉伯文字的外延而非阿拉伯语本身,但该方法可适用于任何使用阿拉伯文字的语言,从而为一种被近十亿人使用的文字体系提供统一的处理框架。