Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
翻译:嵌入模型对现代自然语言处理至关重要。然而,最有效模型的创建依赖于精心构建的监督微调数据。对于英语等高资源语言,此类数据集易于获取;但对于数百种其他语言,这类数据根本不存在。本研究探讨大型语言模型的出现能否帮助弥合这一差距。我们测试了三种生成用于优化嵌入模型的合成三元组数据的策略,包括上下文学习以及两种新方法:分别利用适配器组合和LLM生成器的跨语言微调技术(XL-LoRA)。研究发现,虽然上下文学习仍未能超越强大的非合成基线模型,但适配器组合与XL-LoRA在广泛的任务和语言中均能带来显著的性能提升,为多种语言构建高性能嵌入模型提供了清晰且可扩展的技术路径。