A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
翻译:多语言语言建模中的一个核心考量是如何最佳地表征具有多样化词汇和文字系统的语言。尽管当代文本编码方法已覆盖世界上大多数书写系统,但它们对全球西方高资源语言存在偏向性。因此,未被充分表征的语言文本往往被切分为冗长的、在语言学上无意义的单元序列。为应对这种不平衡性,我们提出一种新范式,该范式能够通过跨不同语言保持尺寸一致的片段来编码相同信息。我们的编码规范(MYTE)基于语素构建,因为与先前方法所使用的字符相比,语素库在跨语言间的分布更为均衡。我们证明,MYTE 为所分析的 99 种语言均生成了更短的编码,其中对非欧洲语言和非拉丁文字系统的改进最为显著。这进而提升了多语言语言模型的性能,并缩小了跨多种语言间的困惑度差距。