In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that overcomes the triviality problem prevalent in previous synthetic datasets, enabling large-scale production of 82K high-quality training samples. Second, we design Redapter, a self-adaptive learning algorithm that dynamically adjusts training each sample's weight based on its reasoning intensity. This allows the model to effectively capture the complex semantic relationships between queries and documents. Third, we implement ReasonEmbed across multiple backbones of varying sizes, all of which achieve superior performance on reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which significantly outperforms existing text embedding models. We will fully open-source our created resources in ReasonEmbed to push forward the research advancement in this field.
翻译:本文提出了一种名为ReasonEmbed的新型文本嵌入模型,专门用于推理密集型文档检索。我们的工作包含三项关键技术贡献。首先,我们提出了ReMixer——一种新型数据合成方法,该方法克服了以往合成数据集中普遍存在的平凡性问题,能够大规模生产8.2万个高质量训练样本。其次,我们设计了Redapter——一种自适应学习算法,可根据样本的推理强度动态调整每个训练样本的权重,使模型能够有效捕捉查询与文档间的复杂语义关系。第三,我们在多种不同规模的基础模型上实现了ReasonEmbed,这些模型均在推理密集型检索任务中取得了卓越性能。值得注意的是,我们的ReasonEmbed-Qwen3-8B模型在BRIGHT基准测试中创下了nDCG@10得分38.1的历史新高,显著优于现有文本嵌入模型。我们将完全开源ReasonEmbed中创建的资源,以推动该领域的研究进展。