Towards Robust Text Retrieval with Progressive Learning

Retrieval augmentation has become an effective solution to empower large language models (LLMs) with external and verified knowledge sources from the database, which overcomes the limitations and hallucinations of LLMs in handling up-to-date and domain-specific information. However, existing embedding models for text retrieval usually have three non-negligible limitations. First, the number and diversity of samples in a batch are too restricted to supervise the modeling of textual nuances at scale. Second, the high proportional noise are detrimental to the semantic correctness and consistency of embeddings. Third, the equal treatment to easy and difficult samples would cause sub-optimum convergence of embeddings with poorer generalization. In this paper, we propose the PEG, a progressively learned embeddings for robust text retrieval. Specifically, we increase the training in-batch negative samples to 80,000, and for each query, we extracted five hard negatives. Concurrently, we incorporated a progressive learning mechanism, enabling the model to dynamically modulate its attention to the samples throughout the entire training process. Additionally, PEG is trained on more than 100 million data, encompassing a wide range of domains (e.g., finance, medicine, and tourism) and covering various tasks (e.g., question-answering, machine reading comprehension, and similarity matching). Extensive experiments conducted on C-MTEB and DuReader demonstrate that PEG surpasses state-of-the-art embeddings in retrieving true positives, highlighting its significant potential for applications in LLMs. Our model is publicly available at https://huggingface.co/TownsWu/PEG.

翻译：检索增强已成为一种有效解决方案，通过从数据库中引入外部且经过验证的知识源来赋能大语言模型（LLMs），从而克服LLMs在处理时效性和领域特定信息时的局限性与幻觉问题。然而，现有的文本检索嵌入模型通常存在三个不可忽视的局限性：首先，批次中样本的数量与多样性过于受限，难以在规模化条件下有效监督文本细微差别的建模；其次，高比例噪声会损害嵌入向量的语义正确性与一致性；最后，对简单样本与困难样本的平等处理会导致嵌入收敛次优，泛化能力下降。本文提出PEG（渐进式学习嵌入）以实现稳健文本检索。具体而言，我们将训练批次负样本数量提升至80,000，并为每个查询抽取五个困难负样本。同时，我们引入渐进式学习机制，使模型在整个训练过程中能动态调整对样本的关注程度。此外，PEG在超过1亿条数据上训练，涵盖金融、医疗、旅游等多领域及问答、机器阅读理解、相似度匹配等多项任务。在C-MTEB和DuReader上的大量实验表明，PEG在检索真实正例方面超越现有最优嵌入模型，彰显其在大语言模型中的巨大应用潜力。我们的模型已公开在https://huggingface.co/TownsWu/PEG。