Information retrieval involves selecting artifacts from a corpus that are most relevant to a given search query. The flavor of retrieval typically used in classical applications can be termed as homogeneous and relaxed, where queries and corpus elements are both natural language (NL) utterances (homogeneous) and the goal is to pick most relevant elements from the corpus in the Top-K, where K is large, such as 10, 25, 50 or even 100 (relaxed). Recently, retrieval is being used extensively in preparing prompts for large language models (LLMs) to enable LLMs to perform targeted tasks. These new applications of retrieval are often heterogeneous and strict -- the queries and the corpus contain different kinds of entities, such as NL and code, and there is a need for improving retrieval at Top-K for small values of K, such as K=1 or 3 or 5. Current dense retrieval techniques based on pretrained embeddings provide a general-purpose and powerful approach for retrieval, but they are oblivious to task-specific notions of similarity of heterogeneous artifacts. We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval. Adapted Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding. We empirically validate our approach by showing improvements over the state-of-the-art general-purpose embeddings-based baseline.
翻译:信息检索涉及从语料库中选取与给定搜索查询最相关的实体。传统应用中常用的检索方式可称为同质宽松型检索,其中查询与语料元素均为自然语言表述(同质性),目标是从语料库中选取Top-K最相关元素(K值较大,如10、25、50甚至100)(宽松性)。近期,检索被广泛用于为大语言模型准备提示词,以使其执行特定任务。这类新型检索应用常具有异质严格性——查询与语料包含不同类型的实体(如自然语言与代码),且需提升小K值(如K=1、3或5)下的Top-K检索效果。基于预训练嵌入的现有稠密检索技术虽提供通用且强大的检索方法,但缺乏对异质实体任务特定相似性的感知能力。我们提出自适应稠密检索机制,通过变换嵌入实现更具任务特异性、异质性与严格性的检索。该机制通过学习预训练黑盒嵌入的低秩残差适配来实现。实验验证表明,该方法相比基于通用嵌入的当前最优基线具有显著改进。