We studied how the local topological structure of sentence-embedding neighborhoods encodes semantic ambiguity. Extending ideas that link word-level polysemy to non-trivial persistent homology, we generalized the concept to full sentences and quantified ambiguity of a query in a semantic search process with two persistent homology metrics: the 1-Wasserstein norm of $H_{0}$ and the maximum loop lifetime of $H_{1}$. We formalized the notion of ambiguity as the relative presence of semantic domains or topics in sentences. We then used this formalism to compute "ab-initio" simulations that encode datapoints as linear combination of randomly generated single topics vectors in an arbitrary embedding space and demonstrate that ambiguous sentences separate from unambiguous ones in both metrics. Finally we validated those findings with real-world case by investigating on a fully open corpus comprising Nobel Prize Physics lectures from 1901 to 2024, segmented into contiguous, non-overlapping chunks at two granularity: $\sim\!250$ tokens and $\sim\!750$ tokens. We tested embedding with four publicly available models. Results across all models reproduce simulations and remain stable despite changes in embedding architecture. We conclude that persistent homology provides a model-agnostic signal of semantic discontinuities, suggesting practical use for ambiguity detection and semantic search recall.
翻译:我们研究了句子嵌入邻域的局部拓扑结构如何编码语义歧义。通过扩展将词级多义性与非平凡持久同调相关联的思想,我们将这一概念推广到完整句子,并使用两个持久同调度量——$H_{0}$的1-Wasserstein范数与$H_{1}$的最大环生命周期——量化了语义搜索过程中查询的歧义性。我们将歧义性形式化定义为句子中语义域或主题的相对呈现程度。随后运用该形式化框架,通过“从头算”模拟将数据点编码为任意嵌入空间中随机生成单主题向量的线性组合,并证明歧义句在两种度量下均与无歧义句分离。最后,我们通过研究完全开放的语料库(包含1901年至2024年诺贝尔物理学讲座文本,按$\sim\!250$词元和$\sim\!750$词元两种粒度分割为连续非重叠片段)进行了真实案例验证。我们测试了四种公开可用模型的嵌入效果,所有模型的结果均复现了模拟实验的结论,且不随嵌入架构的变化而改变。我们得出结论:持久同调提供了语义不连续性的模型无关信号,这为歧义检测与语义搜索召回的实际应用提供了可能。