In enterprise datasets, documents are rarely pure. They are not just text, nor just numbers; they are a complex amalgam of narrative and structure. Current Retrieval-Augmented Generation (RAG) systems have attempted to address this complexity with a blunt tool: linearization. We convert rich, multidimensional tables into simple Markdown-style text strings, hoping that an embedding model will capture the geometry of a spreadsheet in a single vector. But it has already been shown that this is mathematically insufficient. This work presents Topo-RAG, a framework that challenges the assumption that "everything is text". We propose a dual architecture that respects the topology of the data: we route fluid narrative through traditional dense retrievers, while tabular structures are processed by a Cell-Aware Late Interaction mechanism, preserving their spatial relationships. Evaluated on SEC-25, a synthetic enterprise corpus that mimics real-world complexity, Topo-RAG demonstrates an 18.4% improvement in nDCG@10 on hybrid queries compared to standard linearization approaches. It's not just about searching better; it's about understanding the shape of information.
翻译:在企业数据集中,文档很少是单一的。它们不仅是文本,也不仅是数字;而是叙述与结构的复杂混合体。当前的检索增强生成(RAG)系统尝试用一种生硬的方法应对这种复杂性:线性化。我们将丰富的多维表格转换为简单的Markdown风格文本字符串,寄希望于嵌入模型能在单个向量中捕获电子表格的几何结构。但已有研究表明,这在数学上是不充分的。本文提出Topo-RAG框架,挑战“一切皆文本”的假设。我们提出一种尊重数据拓扑的双重架构:将流畅的叙述性内容通过传统的稠密检索器处理,而表格结构则通过单元感知的延迟交互机制处理,以保持其空间关系。在模拟真实世界复杂性的合成企业语料库SEC-25上进行评估,Topo-RAG在混合查询任务上的nDCG@10指标相比标准线性化方法提升了18.4%。这不仅关乎更优的检索,更关乎对信息形态的理解。