Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.
翻译:嵌入模型在现代自然语言处理应用中扮演着关键角色,例如信息检索(IR)和检索增强生成(RAG)。尽管大型语言模型(LLMs)的上下文长度限制已被推至超过100万标记,但嵌入模型仍局限于不超过8k标记的狭窄上下文窗口,无法应用于需要长输入(如法律合同)的场景。本文探索了对现有嵌入模型进行上下文窗口扩展的方法,在不需额外训练的情况下将限制提升至32k。首先,我们在新构建的LongEmbed基准上评估了当前嵌入模型在长上下文检索中的性能。LongEmbed包含两项合成任务和四项精心挑选的真实世界任务,其文档长度各异且目标信息分散。基准测试结果揭示了这些模型存在巨大的改进空间。基于此,全面的实验表明,无需训练的上下文窗口扩展策略(如位置插值)能够有效地将现有嵌入模型的上下文窗口扩展数倍,无论其原始上下文长度是512还是超过4k。此外,对于采用绝对位置编码(APE)的模型,我们展示了通过进一步微调以获得显著性能提升的可能性,同时严格保持其在短输入上的原始行为。对于使用旋转位置嵌入(RoPE)的模型,当采用RoPE专用方法(如NTK和SelfExtend)时,我们观察到显著的性能提升,这表明RoPE在上下文窗口扩展方面优于APE。为了促进未来研究,我们发布了E5-Base-4k和E5-RoPE-Base模型以及LongEmbed基准。