Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we directly optimize on the test set with free parameterized embeddings. Using free embeddings, we then demonstrate that returning all pairs of documents requires a relatively high dimension. We then create a realistic dataset called LIMIT that stress tests embedding models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop new techniques that can resolve this fundamental limitation.
翻译:近年来,向量嵌入技术被赋予日益增多的检索任务,并开始被初步应用于推理、指令跟随、代码生成等领域。这些新兴基准测试要求嵌入模型能够处理任意查询及任意给定的相关性定义。尽管先前研究已指出向量嵌入的理论局限性,但普遍假设这些困难仅源于非现实查询,而通过优化训练数据与扩大模型规模即可克服非现实查询之外的局限。本研究证明,即使在采用极其简单查询的现实场景中,我们仍可能遭遇这些理论局限。我们结合学习理论中的已知结论,证明能够作为查询结果返回的文档top-k子集数量受限于嵌入维度。通过实验验证,即使直接在测试集上使用自由参数化嵌入进行优化,该结论依然成立。利用自由嵌入技术,我们进一步证明返回所有文档对需要相对较高的维度。基于这些理论结果,我们创建了名为LIMIT的现实数据集以对嵌入模型进行压力测试,发现即使最先进的模型在这个任务简单的数据集上也会失败。本研究揭示了现有单向量范式下嵌入模型的内在局限,呼吁未来研究开发能够突破这一根本性限制的新技术。