Retrieval-Augmented Generation (RAG) systems are increasingly deployed on large-scale document collections, often comprising millions of documents and tens of millions of text chunks. In industrial-scale retrieval platforms, scalability is typically addressed through horizontal sharding and a combination of Approximate Nearest-Neighbor search, hybrid indexing, and optimized metadata filtering. Although effective from an efficiency perspective, these mechanisms rely on bottom-up, similarity-driven organization and lack a conceptual rationale for corpus partitioning. In this paper, we claim that the design of large-scale RAG systems may benefit from the combination of two orthogonal strategies: semantic clustering, which optimizes locality in embedding space, and multidimensional partitioning, which governs where retrieval should occur based on conceptual dimensions such as time and organizational context. Although such dimensions are already implicitly present in current systems, they are used in an ad hoc and poorly structured manner. We propose the Dimensional Fact Model (DFM) as a conceptual framework to guide the design of multidimensional partitions for RAG corpora. The DFM provides a principled way to reason about facts, dimensions, hierarchies, and granularity in retrieval-oriented settings. This framework naturally supports hierarchical routing and controlled fallback strategies, ensuring that retrieval remains robust even in the presence of incomplete metadata, while transforming the search process from a 'black-box' similarity matching into a governable and deterministic workflow. This work is intended as a position paper; its goal is to bridge the gap between OLAP-style multidimensional modeling and modern RAG architectures, and to stimulate further research on principled, explainable, and governable retrieval strategies at scale.
翻译:检索增强生成(RAG)系统日益部署于大规模文档集合之上,这些集合通常包含数百万份文档和数千万个文本片段。在工业级检索平台中,可扩展性通常通过水平分片以及近似最近邻搜索、混合索引和优化的元数据过滤相结合的方式来解决。尽管从效率角度来看这些机制是有效的,但它们依赖于自下而上、相似性驱动的组织方式,缺乏语料库划分的概念性依据。本文主张,大规模RAG系统的设计可以从两种正交策略的结合中受益:语义聚类(优化嵌入空间中的局部性)和多维划分(基于时间和组织背景等概念维度来控制检索应在何处进行)。尽管这些维度在当前系统中已隐式存在,但其使用方式往往是临时且缺乏结构化的。我们提出维度事实模型(DFM)作为概念框架,以指导RAG语料库的多维划分设计。DFM为面向检索的场景中的事实、维度、层次结构和粒度提供了原则性的推理方法。该框架天然支持分层路由和受控回退策略,确保即使在元数据不完整的情况下检索仍保持鲁棒性,同时将搜索过程从“黑盒”相似性匹配转变为可管控且确定性的工作流程。本文旨在作为一篇立场论文,其目标是弥合OLAP风格的多维建模与现代RAG架构之间的差距,并推动关于规模化、原则性、可解释且可管控的检索策略的进一步研究。