In the large language model (LLM) revolution, embedding is a key component of various systems. For example, it is used to retrieve knowledge or memories for LLMs, to build content moderation filters, etc. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is desirable to build a unified embedding model rather than dedicated ones for each scenario. In this work, we make an initial step towards this goal, demonstrating that multiple languages (both natural and programming) pre-trained transformer decoders can embed universally when finetuned on limited English data. We provide a comprehensive practice with thorough evaluations. On English MTEB, our models achieve competitive performance on different embedding tasks by minimal training data. On other benchmarks, such as multilingual classification and code search, our models (without any supervision) perform comparably to, or even surpass heavily supervised baselines and/or APIs. These results provide evidence of a promising path towards building powerful unified embedders that can be applied across tasks and languages.
翻译:在大语言模型(LLM)革命中,嵌入是各类系统的关键组件。例如,它被用于为LLM检索知识或记忆、构建内容审核过滤器等。由于这些应用场景涵盖从英语到其他自然语言或编程语言、从检索到分类等多个领域,构建统一的嵌入模型而非针对特定场景的专用模型具有重要价值。本研究朝此目标迈出初步一步,证明多语言(包括自然语言和编程语言)预训练Transformer解码器在有限英语数据微调后能够实现通用嵌入。我们提供了全面的实践与深入评估。在英语MTEB基准测试中,我们的模型通过极少量训练数据即可在不同嵌入任务上取得具有竞争力的表现。在其他基准测试(如多语言分类和代码搜索)中,我们的模型(无需任何监督)能够达到甚至超越强监督基线或API的性能。这些证据表明,构建能够跨任务和语言应用的强大统一嵌入器具有可行路径。