We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.
翻译:我们研究了在不同自然语言处理任务(具体包括情感分析、命名实体识别和依存句法分析)中迁移语言的选择问题。为了选择最优的迁移语言,我们提出利用多种语言相似性度量来测量语言之间的距离,并基于此信息而非直觉来选择迁移语言。我们证明,在所有提议的任务中,语言相似性与跨语言迁移性能相关。同时,我们也表明,选择最优语言而非英语作为迁移源,存在统计学上的显著差异。这使得我们能够选择更合适的迁移语言,更好地利用高资源语言的知识,从而提高缺乏数据语言应用的表现。本研究中,我们使用了来自三个语族、八种不同语言的数据集。