Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging

Massively multilingual language models have displayed strong performance in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer setups, where models fine-tuned on task data in a source language are transferred without any or with only a few annotated instances to the target language(s). However, current work typically overestimates model performance as fine-tuned models are frequently evaluated at model checkpoints that generalize best to validation instances in the target languages. This effectively violates the main assumptions of "true" ZS-XLT and FS-XLT. Such XLT setups require robust methods that do not depend on labeled target language data for validation and model selection. In this work, aiming to improve the robustness of "true" ZS-XLT and FS-XLT, we propose a simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning. We conduct exhaustive ZS-XLT and FS-XLT experiments across higher-level semantic tasks (NLI, extractive QA) and lower-level token classification tasks (NER, POS). The results indicate that averaging model checkpoints yields systematic and consistent performance gains across diverse target languages in all tasks. Importantly, it simultaneously substantially desensitizes XLT to varying hyperparameter choices in the absence of target language validation. We also show that checkpoint averaging benefits performance when further combined with run averaging (i.e., averaging the parameters of models fine-tuned over independent runs).

翻译：大规模多语言语言模型在零样本（ZS-XLT）和少样本（FS-XLT）跨语言迁移设置中表现出强劲性能，其中在源语言任务数据上微调的模型无需或仅需少量目标语言的标注实例即可迁移至目标语言。然而，当前研究通常高估模型性能，因为微调模型常被用于评估对目标语言验证实例泛化最优的检查点。这实质上违背了“真正”ZS-XLT和FS-XLT的主要假设。此类XLT设置需要不依赖目标语言标注数据进行验证和模型选择的稳健方法。在本工作中，为提升“真正”ZS-XLT和FS-XLT的稳健性，我们提出一种简单有效的方法：在任务微调过程中平均不同检查点（即模型快照）。我们针对高层级语义任务（NLI、抽取式QA）和低层级词元分类任务（NER、POS）开展了全面的ZS-XLT和FS-XLT实验。结果表明，平均模型检查点在所有任务的不同目标语言中均能带来系统且一致的性能提升。重要的是，该方法在缺乏目标语言验证的情况下，同步显著降低XLT对不同超参数选择的敏感性。此外，我们还证明，当检查点平均与运行平均（即平均独立运行微调后的模型参数）进一步结合时，性能可获得额外提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

专知会员服务

11+阅读 · 2020年5月25日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日