Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.
翻译:摘要:检索增强生成(RAG)已作为应对大型语言模型(LLM)局限性的框架出现。然而,其有效性从根本上取决于文档分块——这一常被忽视的质量决定性因素。本文通过实证研究量化了四种分块策略的性能差异:固定大小滑动窗口、递归式、基于断点的语义式以及结构感知式。我们使用包含大量文本的说明书、以表格为主的规格书以及管道与仪表图(P&ID)在内的油气企业专有文档语料库,对这些方法进行了评估。研究发现,结构感知分块在整体检索有效性上表现更优,尤其在Top-K指标方面,且其计算成本显著低于语义式或基线策略。至关重要的是,所有四种方法在P&ID上的有效性均有限,这凸显了纯文本RAG在处理视觉与空间编码文档时的核心局限性。我们得出结论:尽管显式结构保持对专业领域至关重要,但未来研究必须整合多模态模型以突破当前瓶颈。