Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feature of these languages and encode words as linear sequences of Unicode characters using an intricate scheme of connector characters and font interpreters. Due to this way of using a few dozen Unicode glyphs to write thousands of different unique glyphs (complex graphemes), there are serious ambiguities that lead to malformed words. In this paper, we are proposing two libraries: i) a normalizer for normalizing inconsistencies caused by a Unicode-based encoding scheme for Indic languages and ii) a grapheme parser for Abugida text. It deconstructs words into visually distinct orthographic syllables or complex graphemes and their constituents. Our proposed normalizer is a more efficient and effective tool than the previously used IndicNLP normalizer. Moreover, our parser and normalizer are also suitable tools for general Abugida text processing as they performed well in our robust word-based and NLP experiments. We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
翻译:印度语言的书写系统以正字音节(亦称复杂字素)作为独特的水平单位。这些语言的一个显著特征在于这些复杂字素单元——由辅音/辅音连字、元音符和辅音符构成,共同形成独特的语言体系。基于Unicode的印度语言书写方案常忽视这一特征,通过连接字符和字体解释器的复杂机制将词汇编码为Unicode字符的线性序列。这种使用数十个Unicode字形书写数千种不同独特字形(复杂字素)的方式,导致了严重的歧义并产生畸形词汇。本文提出两个库:i) 用于标准化印度语言Unicode编码方案所引发不一致性的规范化器;ii) 用于元音附标文本的字素解析器。该解析器可将词汇解构为视觉上可区分的正字音节(复杂字素)及其构成成分。相较于先前使用的IndicNLP规范化器,我们提出的规范化器具有更高效率与更强效能。此外,基于我们在词汇层面和自然语言处理实验中的稳健表现,该解析器与规范化器同样适用于通用元音附标文本处理。本研究针对7种语言的文字体系构建了处理流程,并开发了支持更多文字体系集成的框架。