LongEmbed: Extending Embedding Models for Long Context Retrieval

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

翻译：嵌入模型在诸如信息检索（IR）和检索增强生成（RAG）等现代自然语言处理应用中发挥着关键作用。尽管大语言模型（LLM）的上下文限制已被推至超过100万个标记，但嵌入模型仍局限于不超过8000个标记的窄上下文窗口，这限制了其在需要长输入（如法律合同）的应用场景中的使用。本文探索了现有嵌入模型的上下文窗口扩展，在不需额外训练的情况下将限制推至32k。首先，我们在新构建的LongEmbed基准上评估了当前嵌入模型在长上下文检索中的性能。LongEmbed包含两个合成任务和四个精心挑选的真实世界任务，涵盖不同长度的文档及分散的目标信息。基准测试结果突显了这些模型存在巨大的改进空间。在此基础上，综合实验表明，位置插值等无需训练的上下文窗口扩展策略可有效将现有嵌入模型的上下文窗口扩展数倍，无论其原始上下文是512还是超过4000。此外，对于采用绝对位置编码（APE）的模型，我们展示了进一步微调的可能性，以在严格保持短输入原始行为的同时取得显著性能提升。对于使用旋转位置嵌入（RoPE）的模型，当采用针对RoPE的方法（如NTK和SelfExtend）时，观察到显著增强，这表明RoPE在上下文窗口扩展方面优于APE。为促进未来研究，我们发布了E5-Base-4k、E5-RoPE-Base以及LongEmbed基准。