Large language models (LLMs) face significant challenges stemming from the inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM Embedder, which comprehensively support the diverse needs of LLMs' retrieval augmentation with one unified embedding model. Training such an unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and the use of homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. This project is made publicly available at https://github.com/FlagOpen/FlagEmbedding.
翻译:大型语言模型(LLMs)面临着由知识、记忆、对齐和行动方面的固有限制所带来的重大挑战。这些挑战无法仅靠LLMs自身解决,而应依赖于外部世界的辅助,例如知识库、记忆存储、演示示例和工具。检索增强是弥合LLMs与外部辅助之间差距的关键机制。然而,传统方法面临着两个紧迫问题。一方面,通用检索器并未针对LLMs的检索增强进行适当优化。另一方面,特定任务检索器缺乏所需的通用性,阻碍了其在多样化检索增强场景中的表现。在本工作中,我们提出了一种新方法——LLM Embedder,通过一个统一的嵌入模型全面支持LLMs检索增强的多样化需求。训练这样一个统一模型并非易事,因为不同的检索任务旨在捕获不同的语义关系,且往往相互干扰。为应对这一挑战,我们系统地优化了训练方法。这包括基于LLMs反馈的奖励公式化、知识蒸馏的稳定性、带有明确指令的多任务微调,以及采用同质批次内负采样。这些优化策略使LLM-Embedder取得了显著的实证表现。值得注意的是,它在LLMs的检索增强中带来了显著提升,在多种评估场景中超越了通用和特定任务检索器。该项目已在 https://github.com/FlagOpen/FlagEmbedding 公开提供。