Previous cross-lingual transfer methods are restricted to orthographic representation learning via textual scripts. This limitation hampers cross-lingual transfer and is biased towards languages sharing similar well-known scripts. To alleviate the gap between languages from different writing scripts, we propose PhoneXL, a framework incorporating phonemic transcriptions as an additional linguistic modality beyond the traditional orthographic transcriptions for cross-lingual transfer. Particularly, we propose unsupervised alignment objectives to capture (1) local one-to-one alignment between the two different modalities, (2) alignment via multi-modality contexts to leverage information from additional modalities, and (3) alignment via multilingual contexts where additional bilingual dictionaries are incorporated. We also release the first phonemic-orthographic alignment dataset on two token-level tasks (Named Entity Recognition and Part-of-Speech Tagging) among the understudied but interconnected Chinese-Japanese-Korean-Vietnamese (CJKV) languages. Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer and bridge the gap among CJKV languages, leading to consistent improvements on cross-lingual token-level tasks over orthographic-based multilingual PLMs.
翻译:以往的跨语言迁移方法局限于通过文本符号进行正字法表示学习,这一限制阻碍了跨语言迁移,并且偏向于共享相似知名书写系统的语言。为弥合不同书写系统语言之间的差距,我们提出PhoneXL框架,该框架将音素转录作为传统正字法转录之外的额外语言模态,用于跨语言迁移。具体而言,我们提出了无监督对齐目标,以捕捉:(1)两种不同模态之间的局部一对一对齐,(2)通过多模态上下文利用额外模态信息的对齐,以及(3)通过融入额外双语词典的多语言上下文对齐。我们还发布了首个音素-正字法对齐数据集,涵盖两个词级任务(命名实体识别和词性标注),针对研究不足但相互关联的汉-日-韩-越(CJKV)语言。我们的初步研究表明,音素转录提供了正字法之外的关键信息,能够增强跨语言迁移并弥合CJKV语言之间的差距,从而在跨语言词级任务上相较于基于正字法的多语言预训练语言模型取得持续改进。