Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection for part-of-speech tagging, considering two different sources for morphosyntactic features. While previous work examines these dynamics in the context of bilingual biLSTMS, we extend our analysis to a more modern transfer learning pipeline: zero-shot prediction with pretrained multilingual models. We train a series of transfer language ranking systems and examine how different feature inputs influence ranker performance across architectures. Word overlap, type-token ratio, and genealogical distance emerge as top features across all architectures. Our findings reveal that a combination of typological and dataset-dependent features leads to the best rankings, and that good performance can be obtained with either feature group on its own.
翻译:跨语言迁移学习是克服数据稀缺性的宝贵工具,然而选择合适的源语言仍具挑战。语言类型学、训练数据与模型架构在源语言选择中的具体作用尚未被完全理解。本文采用整体性方法,研究数据集特定特征与细粒度类型学特征如何共同影响词性标注任务中的源语言选择,并考察了两种不同形态句法特征的来源。先前研究主要在双语双向长短期记忆网络的背景下探讨这些动态,而我们将分析扩展至更现代的迁移学习流程:基于预训练多语言模型的零样本预测。我们训练了一系列源语言排序系统,并探究不同特征输入如何影响各架构下的排序器性能。词重叠度、类符-形符比和谱系距离在所有架构中均显现为关键特征。研究结果表明,结合类型学特征与数据集依赖特征可获得最优排序效果,且任一特征组单独使用亦能取得良好性能。