We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.
翻译:本文系统性地介绍了从零开始构建长上下文多语言文本表示模型与重排序器以用于文本检索的研究工作。我们首先提出了一种采用RoPE位置编码并移除填充操作的文本编码器(基础规模),该模型在原生8192词元上下文长度(超越先前多语言编码器普遍采用的512词元长度)下进行预训练。随后,我们通过对比学习构建了混合式文本表示模型与交叉编码器重排序器。评估结果表明,我们的文本编码器在同等规模下超越了先前最优的XLM-R模型。同时,我们的文本表示模型与重排序器达到了大规模最优模型BGE-M3的性能水平,并在长上下文检索基准测试中取得了更优结果。进一步分析表明,我们提出的模型在训练与推理阶段均展现出更高的效率。我们相信其高效性与有效性将为学术研究及工业应用带来广泛价值。