LongEmbed: Extending Embedding Models for Long Context Retrieval

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

翻译：嵌入模型在现代自然语言处理应用中扮演着关键角色，例如信息检索（IR）和检索增强生成（RAG）。尽管大型语言模型（LLMs）的上下文长度限制已被推至超过100万标记，但嵌入模型仍局限于不超过8k标记的狭窄上下文窗口，无法应用于需要长输入（如法律合同）的场景。本文探索了对现有嵌入模型进行上下文窗口扩展的方法，在不需额外训练的情况下将限制提升至32k。首先，我们在新构建的LongEmbed基准上评估了当前嵌入模型在长上下文检索中的性能。LongEmbed包含两项合成任务和四项精心挑选的真实世界任务，其文档长度各异且目标信息分散。基准测试结果揭示了这些模型存在巨大的改进空间。基于此，全面的实验表明，无需训练的上下文窗口扩展策略（如位置插值）能够有效地将现有嵌入模型的上下文窗口扩展数倍，无论其原始上下文长度是512还是超过4k。此外，对于采用绝对位置编码（APE）的模型，我们展示了通过进一步微调以获得显著性能提升的可能性，同时严格保持其在短输入上的原始行为。对于使用旋转位置嵌入（RoPE）的模型，当采用RoPE专用方法（如NTK和SelfExtend）时，我们观察到显著的性能提升，这表明RoPE在上下文窗口扩展方面优于APE。为了促进未来研究，我们发布了E5-Base-4k和E5-RoPE-Base模型以及LongEmbed基准。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日