Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at https://github.com/dltmddbs100/CLEAR.
翻译:摘要:现有跨语言嵌入模型常因语言资源不均衡及训练过程中对跨语言对齐的考量不足,在跨语言场景中面临挑战。尽管标准化的跨语言对比学习方法已被广泛采用,但其难以捕获语言间的基础对齐关系,并导致英语等已良好对齐语言的性能下降。为解决这些问题,我们提出CLEAR(Cross-Lingual Enhancement in Retrieval via Reverse-training)——一种利用逆向训练方案的新型损失函数,以提升多样跨语言检索场景下的检索性能。CLEAR以英语段落为桥梁,强化目标语言与英语之间的对齐关系,确保跨语言检索任务的鲁棒性能。大量实验表明,CLEAR在跨语言场景中取得显著改进,尤其在低资源语言上提升幅度高达15%,同时最小化英语性能的下降。此外,我们的发现凸显了CLEAR在多语言训练中的显著有效性,表明其具有广泛适用性与可扩展潜力。代码已开源至https://github.com/dltmddbs100/CLEAR。