This report explores the enhancement of text retrieval performance using advanced data refinement techniques. We develop Linq-Embed-Mistral\footnote{\url{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral}} by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on sophisticated data crafting, data filtering, and negative mining methods, which are highly tailored to each task, applied to both existing benchmark dataset and highly tailored synthetic dataset generated via large language models (LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024), achieving an average score of 68.2 across 56 datasets, and ranks 1st among all models for retrieval tasks on the MTEB leaderboard with a performance score of 60.2. This performance underscores its superior capability in enhancing search precision and reliability. Our contributions include advanced data refinement methods that significantly improve model performance on benchmark and synthetic datasets, techniques for homogeneous task ordering and mixed task fine-tuning to enhance model generalization and stability, and a streamlined evaluation process using 4-bit precision and a light retrieval evaluation set, which accelerates validation without sacrificing accuracy.
翻译:本报告探讨了利用先进的数据精炼技术来提升文本检索性能。我们在E5-mistral和Mistral-7B-v0.1模型的基础上,开发了Linq-Embed-Mistral\footnote{\url{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral}},其核心在于针对每项任务高度定制化的复杂数据构建、数据过滤和负例挖掘方法。这些方法应用于现有的基准数据集以及通过大语言模型生成的高度定制化的合成数据集。Linq-Embed-Mistral在MTEB基准测试中表现卓越(截至2024年5月29日),在56个数据集上平均得分达到68.2,并在MTEB排行榜的检索任务中以60.2的性能得分位列所有模型之首。这一性能突显了其在提升搜索精度和可靠性方面的卓越能力。我们的贡献包括:显著提升模型在基准和合成数据集上性能的先进数据精炼方法;通过同质任务排序和混合任务微调以增强模型泛化能力和稳定性的技术;以及采用4比特精度和轻量级检索评估集的简化评估流程,该流程在不牺牲准确性的前提下加速了验证过程。