Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.
翻译:识别具有特定语义的文本是许多信息检索场景的核心任务。基于向量嵌入的相似性搜索似乎是实现这一能力的关键,然而当前文本嵌入所反映的相似性是基于语料库驱动的,对于许多实际用例而言存在不一致性和次优性。那么,对于文本的有效检索,何种相似性概念更为合适?我们指出需要基于对其内容的抽象描述来搜索文本,并相应提出了“基于描述的相似性”这一概念。我们论证了当前文本嵌入方法的不足,并提出一种替代模型,该模型在标准最近邻搜索中表现显著提升。该模型通过提示大语言模型生成的正面与负面对进行训练,证明了大语言模型生成的数据可用于创建原始模型无法直接实现的新能力。