Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications, such as semantic matching, clustering, and information retrieval, continue to rely on text embeddings for their efficiency and effectiveness. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, utilizing their innate capabilities for embedding generation; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing these efforts based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.
翻译:在深度学习时代,文本嵌入已成为自然语言处理(NLP)领域的一项基础性技术,推动了广泛下游任务的进步。尽管许多自然语言理解挑战如今可以通过生成范式建模,并利用大语言模型(LLMs)强大的生成和理解能力,但众多实际应用,如语义匹配、聚类和信息检索,由于其效率和有效性,仍然依赖于文本嵌入。在本综述中,我们将LLMs与文本嵌入之间的相互作用归纳为三大主题:(1)LLM增强的文本嵌入,即利用LLMs增强传统嵌入方法;(2)LLMs作为文本嵌入器,利用其内在能力生成嵌入;(3)基于LLMs的文本嵌入理解,利用LLMs分析和解释嵌入。通过基于交互模式而非特定下游应用来组织这些工作,我们为LLM时代来自不同研究和应用领域的贡献提供了一个新颖且系统的概览。此外,我们强调了在LLM时代之前,预训练语言模型(PLMs)中持续存在的未解挑战,并探讨了LLMs带来的新兴障碍。基于此分析,我们概述了文本嵌入发展的未来方向,以应对NLP快速演进格局中的理论和实践机遇。