This works reports evidence for the topological signatures of ambiguity in sentence embeddings that could be leveraged for ranking and/or explanation purposes in the context of vector search and Retrieval Augmented Generation (RAG) systems. We proposed a working definition of ambiguity and designed an experiment where we have broken down a proprietary dataset into collections of chunks of varying size - 3, 5, and 10 lines and used the different collections successively as queries and answers sets. It allowed us to test the signatures of ambiguity with removal of confounding factors. Our results show that proxy ambiguous queries (size 10 queries against size 3 documents) display different distributions of homologies 0 and 1 based features than proxy clear queries (size 5 queries against size 10 documents). We then discuss those results in terms increased manifold complexity and/or approximately discontinuous embedding submanifolds. Finally we propose a strategy to leverage those findings as a new scoring strategy of semantic similarities.
翻译:本文报告了句子嵌入中歧义的拓扑特征证据,这些特征可用于向量搜索和检索增强生成(RAG)系统中的排序和/或解释目的。我们提出了歧义的工作定义,并设计了一项实验:将专有数据集分解为不同规模(3行、5行和10行)的文本块集合,并依次将不同集合作为查询集和答案集。该方法使我们能够在消除混杂因素的情况下检验歧义特征。结果表明,代理歧义查询(10行查询对3行文档)与代理清晰查询(5行查询对10行文档)在基于同调0和同调1的特征上呈现不同分布。我们随后从流形复杂度增加和/或近似不连续嵌入子流形的角度讨论了这些结果。最后,我们提出了一种策略,将这些发现转化为新的语义相似度评分方案。