Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why ``better'' alignment often fails to translate into ``better'' cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.
翻译:更好的跨语言对齐通常被认为能带来更好的跨语言迁移。然而,显式对齐技术——尽管增加了嵌入相似性——却经常无法提升词级下游任务性能。在这项工作中,我们表明这种不匹配源于对齐与下游任务目标在很大程度上是正交的,并且对齐对下游任务带来的益处在不同语言和任务类型间存在显著差异。我们分析了在多种语言对上对齐并针对词性标注或句子分类进行微调的四个XLM-R编码器模型。通过使用包括嵌入距离、梯度相似性以及任务和对齐损失的梯度大小在内的表征分析,我们发现:(1) 嵌入距离单独作为任务性能提升(或下降)的预测指标并不可靠;(2) 对齐梯度与任务梯度通常接近正交,这表明优化一个目标可能对优化另一个目标贡献甚微。综合来看,我们的研究结果解释了为何“更好的”对齐常常无法转化为“更好的”跨语言迁移。基于这些见解,我们为将跨语言对齐与任务特定微调相结合提供了实用指南,强调了谨慎选择损失函数的重要性。