Word translation or bilingual lexicon induction (BLI) is a key cross-lingual task, aiming to bridge the lexical gap between different languages. In this work, we propose a robust and effective two-stage contrastive learning framework for the BLI task. At Stage C1, we propose to refine standard cross-lingual linear maps between static word embeddings (WEs) via a contrastive learning objective; we also show how to integrate it into the self-learning procedure for even more refined cross-lingual maps. In Stage C2, we conduct BLI-oriented contrastive fine-tuning of mBERT, unlocking its word translation capability. We also show that static WEs induced from the `C2-tuned' mBERT complement static WEs from Stage C1. Comprehensive experiments on standard BLI datasets for diverse languages and different experimental setups demonstrate substantial gains achieved by our framework. While the BLI method from Stage C1 already yields substantial gains over all state-of-the-art BLI methods in our comparison, even stronger improvements are met with the full two-stage framework: e.g., we report gains for 112/112 BLI setups, spanning 28 language pairs.
翻译:词翻译或双语词典归纳(BLI)是一项关键的跨语言任务,旨在弥合不同语言之间的词汇鸿沟。在本工作中,我们为BLI任务提出了一种鲁棒且有效的两阶段对比学习框架。在阶段C1中,我们提出通过对比学习目标来优化静态词嵌入(WEs)之间的标准跨语言线性映射;我们还展示了如何将其整合到自学习过程中,以获得更精细的跨语言映射。在阶段C2中,我们对mBERT进行面向BLI的对比微调,从而解锁其词翻译能力。我们还证明了从“C2微调”后的mBERT导出的静态WEs能够与阶段C1的静态WEs形成互补。在针对多种语言和不同实验设置的标准BLI数据集上进行全面实验,结果表明我们的框架取得了显著提升。尽管阶段C1的BLI方法在我们的比较中已经超越了所有最先进的BLI方法,取得了显著增益,但完整的两阶段框架带来了更强的改进:例如,我们在涵盖28种语言对的112/112个BLI设置中均报告了性能提升。