When Text Embedding Meets Large Language Model: A Comprehensive Survey

Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications - such as semantic matching, clustering, and information retrieval - continue to rely on text embeddings for their efficiency and effectiveness. Therefore, integrating LLMs with text embeddings has become a major research focus in recent years. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, adapting their innate capabilities for high-quality embedding; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing recent works based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.

翻译：在深度学习时代，文本嵌入已成为自然语言处理（NLP）领域的一项基础技术，推动了众多下游任务的进步。尽管许多自然语言理解挑战如今可通过生成范式建模，并利用大语言模型（LLMs）强大的生成与理解能力，但大量实际应用——如语义匹配、聚类和信息检索——因其效率与效能，仍依赖于文本嵌入。因此，将LLMs与文本嵌入相结合已成为近年来的重要研究方向。本综述将LLMs与文本嵌入的交互关系归纳为三大主题：（1）LLM增强的文本嵌入，利用LLMs提升传统嵌入方法；（2）LLMs作为文本嵌入器，适配其内在能力以生成高质量嵌入；（3）基于LLMs的文本嵌入理解，利用LLMs分析与解释嵌入。通过依据交互模式而非特定下游应用来组织近期工作，我们为LLM时代下各研究与应用领域的贡献提供了一个新颖且系统的概览。此外，我们强调了在LLM前时代中预训练语言模型（PLMs）尚未解决的挑战，并探讨了LLMs带来的新兴难题。基于此分析，我们展望了文本嵌入的未来发展方向，以应对NLP快速演进背景下理论与实践中涌现的机遇。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/