Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.
翻译:识别具有给定语义的文本对于许多信息检索场景至关重要。基于向量嵌入的相似性搜索似乎是实现这一能力的核心,然而当前文本嵌入所反映的相似性是语料驱动的,对于许多应用场景而言既不一致也非最优。那么,对于有效检索文本而言,何为良好的相似性概念?我们提出需要根据内容的抽象描述来搜索文本,并据此定义了"基于描述的相似度"概念。我们论证了当前文本嵌入的不足,并提出一种替代模型,该模型在标准最近邻搜索中显著提升了性能。该模型通过大语言模型提示获得的正面和负面配对数据进行训练,展示了如何利用大语言模型的数据来创建原始模型无法直接实现的新能力。