Massively multilingual language models have displayed strong performance in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer setups, where models fine-tuned on task data in a source language are transferred without any or with only a few annotated instances to the target language(s). However, current work typically overestimates model performance as fine-tuned models are frequently evaluated at model checkpoints that generalize best to validation instances in the target languages. This effectively violates the main assumptions of "true" ZS-XLT and FS-XLT. Such XLT setups require robust methods that do not depend on labeled target language data for validation and model selection. In this work, aiming to improve the robustness of "true" ZS-XLT and FS-XLT, we propose a simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning. We conduct exhaustive ZS-XLT and FS-XLT experiments across higher-level semantic tasks (NLI, extractive QA) and lower-level token classification tasks (NER, POS). The results indicate that averaging model checkpoints yields systematic and consistent performance gains across diverse target languages in all tasks. Importantly, it simultaneously substantially desensitizes XLT to varying hyperparameter choices in the absence of target language validation. We also show that checkpoint averaging benefits performance when further combined with run averaging (i.e., averaging the parameters of models fine-tuned over independent runs).
翻译:大规模多语言语言模型在零样本(ZS-XLT)和少样本(FS-XLT)跨语言迁移设置中表现出强劲性能,其中在源语言任务数据上微调的模型无需或仅需少量目标语言的标注实例即可迁移至目标语言。然而,当前研究通常高估模型性能,因为微调模型常被用于评估对目标语言验证实例泛化最优的检查点。这实质上违背了“真正”ZS-XLT和FS-XLT的主要假设。此类XLT设置需要不依赖目标语言标注数据进行验证和模型选择的稳健方法。在本工作中,为提升“真正”ZS-XLT和FS-XLT的稳健性,我们提出一种简单有效的方法:在任务微调过程中平均不同检查点(即模型快照)。我们针对高层级语义任务(NLI、抽取式QA)和低层级词元分类任务(NER、POS)开展了全面的ZS-XLT和FS-XLT实验。结果表明,平均模型检查点在所有任务的不同目标语言中均能带来系统且一致的性能提升。重要的是,该方法在缺乏目标语言验证的情况下,同步显著降低XLT对不同超参数选择的敏感性。此外,我们还证明,当检查点平均与运行平均(即平均独立运行微调后的模型参数)进一步结合时,性能可获得额外提升。