Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines, respectively, while improving token utilization and processing efficiency. In retrieval benchmarks, STC improves MRR from 0.3576 to 0.5945 in a hybrid setting and increases Recall@1 from 0.366 to 0.754 in BM25-only retrieval. These results demonstrate that preserving structure during chunking improves retrieval performance, highlighting the importance of structure-aware chunking for RAG over tabular data.
翻译:诸如CSV和Excel文件等表格文档广泛用于企业数据管道,然而现有面向检索增强生成的分块策略主要针对非结构化文本设计,未考虑表格结构。我们提出了一种结构感知的表格分块框架,该框架通过构建分层行树表示,在行级单元上运作——其中每一行被编码为键值块。STC基于结构边界执行令牌约束的分裂,并应用无重叠的贪婪合并以生成稠密且无重叠的分块。该设计在保留行内字段间语义关系的同时,提升了令牌利用率并减少了碎片化。在MAUD数据集上的评估中,与标准递归分块和基于键值分块的基线相比,STC分别减少了高达40%和56%的分块数量,同时提升了令牌利用率和处理效率。在检索基准测试中,STC在混合模式下将MRR从0.3576提升至0.5945,且在仅使用BM25检索时,Recall@1从0.366提升至0.754。这些结果表明,在分块过程中保留结构可提升检索性能,突显了针对表格数据的RAG中结构感知分块的重要性。