Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen's d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.
翻译:跨语言迁移学习通过利用高资源语言的标注数据来实现低资源语言的自然语言处理,然而现有对源语言选择策略的比较未控制总训练数据量,混淆了语言选择效应与数据量效应。我们提出Budget-Xfer框架,将多源跨语言迁移形式化为预算约束的资源分配问题。给定固定标注预算B,该框架联合优化应包含哪些源语言及每个源语言应分配的数据量。我们针对三种非洲目标语言(豪萨语、约鲁巴语、斯瓦希里语)的命名实体识别和情感分析任务,使用两种多语言模型评估了四种分配策略,共进行288次实验。结果表明:(1)多源迁移显著优于单源迁移(Cohen's d = 0.80至1.98),其驱动力来自结构性预算利用不足瓶颈;(2)在多源策略中,差异较小且不显著;(3)嵌入相似度作为选择代理的价值具有任务依赖性,在NER任务中随机选择优于基于相似度的选择,但在情感分析任务中则不成立。