LongEmbed: Extending Embedding Models for Long Context Retrieval

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

翻译：嵌入模型在现代自然语言处理应用（如信息检索和检索增强生成）中扮演着关键角色。尽管大语言模型的上下文限制已被突破至百万级标记，但嵌入模型仍被限制在不超过8000个标记的窄上下文窗口内，这使其难以适用于法律合同等需要长输入的应用场景。本文探索现有嵌入模型的上下文窗口扩展方法，在不需额外训练的情况下将其限制推至32000个标记。首先，我们在新构建的LongEmbed基准上评估当前嵌入模型在长上下文检索中的性能。LongEmbed包含两个合成任务和四个精心选择的真实任务，涵盖不同长度的文档及其分散的目标信息。基准测试结果表明这些模型仍有巨大改进空间。基于此，全面实验显示无需训练的上下文窗口扩展策略（如位置插值）可有效将现有嵌入模型的上下文窗口扩展数倍，无论其原始上下文是512还是超过4000个标记。此外，对于采用绝对位置编码的模型，我们展示了进一步微调的可能性，从而在严格保留短输入原始行为的同时获得显著的性能提升。对于使用旋转位置嵌入的模型，当采用基于RoPE的特定方法（如NTK和SelfExtend）时观察到显著改进，表明RoPE在上下文窗口扩展方面优于绝对位置编码。为促进未来研究，我们发布了E5-Base-4k和E5-RoPE-Base模型以及LongEmbed基准。