Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between these languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies. To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced. We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.
翻译:提升多语言理解能力的方法往往面临高资源语言与低资源语言之间存在显著性能差距的挑战。尽管已有研究尝试在单一潜在空间中对齐语言以缓解此类差距,但不同输入层表征(特别是音素输入)如何影响此类差距尚未得到充分探究。我们假设性能差距受这些语言间表征差异的影响,并重新审视音素表征作为缓解此类差异手段的效用。为验证音素表征的有效性,我们在涵盖12种语言的三个代表性跨语言任务上开展实验。结果表明:与正字法表征相比,音素表征在语言间表现出更高的相似性;在资源相对匮乏的语言上,其性能持续优于基于字素的基线模型。我们通过三个跨语言任务提供量化证据以证明音素表征的有效性,并通过对跨语言性能差距的理论分析进一步验证该结论。