We present Geodesic Semantic Search (GSS), a retrieval system that learns node-specific Riemannian metrics on citation graphs to enable geometry-aware semantic search. Unlike standard embedding-based retrieval that relies on fixed Euclidean distances, \gss{} learns a low-rank metric tensor $\mL_i \in \R^{d \times r}$ at each node, inducing a local positive semi-definite metric $\mG_i = \mL_i \mL_i^\top + \eps \mI$. This parameterization guarantees valid metrics while keeping the model tractable. Retrieval proceeds via multi-source Dijkstra on the learned geodesic distances, followed by Maximal Marginal Relevance reranking and path coherence filtering. On citation prediction benchmarks with 169K papers, \gss{} achieves 23\% relative improvement in Recall@20 over SPECTER+FAISS baselines while providing interpretable citation paths. Our hierarchical coarse-to-fine search with k-means pooling reduces computational cost by 4$\times$ compared to flat geodesic search while maintaining 97\% retrieval quality. We provide theoretical analysis of when geodesic distances outperform direct similarity, characterize the approximation quality of low-rank metrics, and validate predictions empirically. Code and trained models are available at https://github.com/YCRG-Labs/geodesic-search.
翻译:本文提出测地语义搜索(GSS),一种通过学习引文图上节点特定的黎曼度量来实现几何感知语义搜索的检索系统。与依赖固定欧氏距离的标准基于嵌入的检索不同,GSS 在每个节点学习一个低秩度量张量 $\mL_i \in \R^{d \times r}$,从而诱导出局部半正定度量 $\mG_i = \mL_i \mL_i^\top + \eps \mI$。该参数化在保证度量有效性的同时保持了模型的可处理性。检索过程通过基于学习到的测地距离进行多源 Dijkstra 算法实现,随后进行最大边际相关性重排序和路径一致性过滤。在包含 16.9 万篇论文的引文预测基准测试中,GSS 在 Recall@20 指标上相比 SPECTER+FAISS 基线实现了 23% 的相对提升,同时提供了可解释的引文路径。我们采用基于 k-means 池化的分层由粗到精搜索策略,与平坦测地搜索相比,计算成本降低了 4 倍,同时保持了 97% 的检索质量。我们从理论上分析了测地距离优于直接相似性的条件,刻画了低秩度量的近似质量,并进行了实证预测验证。代码和训练模型可在 https://github.com/YCRG-Labs/geodesic-search 获取。