In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with Optimized Embeddings and Vocabulary), a comprehensive approach developed to improve the effectiveness of Cross-Lingual Transfer Learning, particularly in languages with limited resources. Our approach tackles two essential elements of a language model: the initialization of embeddings and the optimal vocabulary size. Specifically, we propose a novel embedding initialization method that leverages both lexical and semantic alignment for a language. In addition, we present a method for systematically searching for the optimal vocabulary size, ensuring a balance between model complexity and linguistic coverage. Our experiments across multilingual datasets show that our approach greatly improves the F1-Score in several languages. UniBridge is a robust and adaptable solution for cross-lingual systems in various languages, highlighting the significance of initializing embeddings and choosing the right vocabulary size in cross-lingual environments.
翻译:本文介绍了UniBridge(基于优化嵌入与词汇的跨语言迁移学习),这是一种旨在提升跨语言迁移学习效果的综合方法,尤其适用于资源有限的语言。我们的方法针对语言模型的两个关键要素:嵌入初始化和最优词汇量。具体而言,我们提出了一种新颖的嵌入初始化方法,该方法同时利用语言的词汇对齐与语义对齐。此外,我们提出了一种系统搜索最优词汇量的方法,以确保模型复杂度与语言覆盖范围之间的平衡。我们在多语言数据集上的实验表明,该方法显著提升了多种语言的F1分数。UniBridge是一种适用于多种语言的跨语言系统的鲁棒且适应性强的解决方案,凸显了在跨语言环境中初始化嵌入和选择合适词汇量的重要性。