We present Geodesic Semantic Search (GSS), a retrieval system that learns node-specific Riemannian metrics on citation graphs to enable geometry-aware semantic search. Unlike standard embedding-based retrieval that relies on fixed Euclidean distances, \gss{} learns a low-rank metric tensor $\mL_i \in \R^{d \times r}$ at each node, inducing a local positive semi-definite metric $\mG_i = \mL_i \mL_i^\top + \eps \mI$. This parameterization guarantees valid metrics while keeping the model tractable. Retrieval proceeds via multi-source Dijkstra on the learned geodesic distances, followed by Maximal Marginal Relevance reranking and path coherence filtering. On citation prediction benchmarks with 169K arXiv papers, GSS achieves 23\% relative improvement in Recall@20 over SPECTER+FAISS baselines. We provide a Bridge Recovery Guarantee characterizing when geodesic retrieval qualitatively outperforms direct similarity, a margin separation result connecting training loss to retrieval quality, and characterize the expressiveness of low-rank metric parameterization. Our hierarchical coarse-to-fine search with k-means pooling reduces computational cost by $4\times$ while maintaining 97\% retrieval quality.
翻译:我们提出了Geodesic Semantic Search(GSS),一种通过在引文图上学习节点特定的黎曼度量来实现几何感知语义搜索的检索系统。与依赖固定欧氏距离的标准嵌入检索不同,GSS在每个节点学习一个低秩度量张量$\mL_i \in \R^{d \times r}$,从而诱导出局部半正定度量$\mG_i = \mL_i \mL_i^\top + \eps \mI$。这种参数化方法在保证模型可操作性的同时,确保了度量的有效性。检索过程通过基于学习到的测地距离的多源Dijkstra算法进行,随后进行最大边际相关性重排序和路径一致性过滤。在包含169K篇arXiv论文的引文预测基准测试中,GSS在Recall@20上相比SPECTER+FAISS基线实现了23%的相对提升。我们提供了桥接恢复保证,用以描述测地检索何时在质量上优于直接相似性;建立了边际分离结果,将训练损失与检索质量联系起来;并刻画了低秩度量参数化的表达能力。我们采用k-means池化的分层粗到细搜索,在保持97%检索质量的同时,将计算成本降低了4倍。