There are two primary approaches to addressing cross-lingual transfer: multilingual pre-training, which implicitly aligns the hidden representations of various languages, and translate-test, which explicitly translates different languages into an intermediate language, such as English. Translate-test offers better interpretability compared to multilingual pre-training. However, it has lower performance than multilingual pre-training(Conneau and Lample, 2019; Conneau et al, 2020) and struggles with word-level tasks due to translation altering word order. As a result, we propose a new Machine-created Universal Language (MUL) as an alternative intermediate language. MUL comprises a set of discrete symbols forming a universal vocabulary and a natural language to MUL translator for converting multiple natural languages to MUL. MUL unifies shared concepts from various languages into a single universal word, enhancing cross-language transfer. Additionally, MUL retains language-specific words and word order, allowing the model to be easily applied to word-level tasks. Our experiments demonstrate that translating into MUL yields improved performance compared to multilingual pre-training, and our analysis indicates that MUL possesses strong interpretability. The code is at: https://github.com/microsoft/Unicoder/tree/master/MCUL.
翻译:跨语言迁移主要有两种方法:多语言预训练(隐式对齐不同语言的隐藏表示)和翻译-测试(将不同语言显式翻译成中间语言,如英语)。与多语言预训练相比,翻译-测试具有更好的可解释性,但其性能低于多语言预训练(Conneau 和 Lample, 2019;Conneau 等, 2020),且由于翻译会改变词序,难以处理词汇级任务。为此,我们提出一种新的机器创造的通用语言(MUL)作为替代中间语言。MUL 由一组构成通用词汇表的离散符号及一个自然语言到 MUL 的翻译器组成,可将多种自然语言转换为 MUL。MUL 将不同语言的共享概念统一为单个通用词,从而增强跨语言迁移;同时保留了语言特有词汇和词序,使模型易于应用于词汇级任务。实验表明,翻译为 MUL 的性能优于多语言预训练,且分析显示 MUL 具有较强的可解释性。代码见:https://github.com/microsoft/Unicoder/tree/master/MCUL。