Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.
翻译:嵌入几何在检索质量中起着基础性作用,然而用于检索增强生成(RAG)的密集检索器目前仍主要局限于欧几里得空间。然而,自然语言呈现出从宽泛主题到具体实体的层次结构,欧几里得嵌入无法保持这种结构,导致语义上相距较远的文档出现虚假相似性,从而增加幻觉风险。为应对这些局限,我们引入了双曲密集检索,在双曲空间的洛伦兹模型中开发了两种模型变体:HyTE-FH(一种完全双曲的Transformer)和HyTE-H(一种将预训练欧几里得嵌入投影到双曲空间的混合架构)。为防止序列聚合过程中的表示坍缩,我们提出了向外爱因斯坦中点(Outward Einstein Midpoint),这是一种几何感知的池化算子,可证明保持层次结构。在MTEB基准上,HyTE-FH优于等效的欧几里得基线模型;而在RAGBench基准上,HyTE-H在上下文相关性和答案相关性方面相比欧几里得基线实现了高达29%的性能提升,且所用模型规模远小于当前最先进的检索器。我们的分析还表明,双曲表示通过基于范数的分离编码文档特异性,从一般概念到具体概念的径向距离增加超过20%,这一特性在欧几里得嵌入中完全缺失,从而凸显了几何归纳偏置在可信RAG系统中的关键作用。