Beyond Chunk-Then-Embed: A Comprehensive Taxonomy and Evaluation of Document Chunking Strategies for Information Retrieval

Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult. This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which determine the timing of chunking relative to embedding (pre-embedding chunking vs. contextualized chunking). Our reproduction evaluates these approaches in two distinct retrieval settings established in previous work: in-document retrieval (needle-in-a-haystack) and in-corpus retrieval (the standard information retrieval task). Our comprehensive evaluation reveals that optimal chunking strategies are task-dependent: simple structure-based methods outperform LLM-guided alternatives for in-corpus retrieval, while LumberChunker performs best for in-document retrieval. Contextualized chunking improves in-corpus effectiveness but degrades in-document retrieval. We also find that chunk size correlates moderately with in-document but weakly with in-corpus effectiveness, suggesting segmentation method differences are not purely driven by chunk size. Our code and evaluation benchmarks are publicly available at (Anonymoused).

翻译：文档分块是稠密检索系统中的关键预处理步骤，然而分块策略的设计空间仍缺乏深入理解。近期研究提出了多种并行方法，包括LLM引导方法（例如DenseX和LumberChunker）和上下文感知策略（例如延迟分块），这些方法通过在分割前生成嵌入来保留上下文信息。然而，这些方法独立出现，且在重叠度极低的基准测试中进行评估，导致直接比较存在困难。本文复现了文档分块领域的先前研究，提出了一个系统性框架，将现有策略沿两个关键维度进行统一：(1) 分割方法，包括基于结构的方法（固定尺寸、基于句子和基于段落）以及语义感知和LLM引导方法；(2) 嵌入范式，决定分块相对于嵌入的时序（嵌入前分块与上下文感知分块）。我们的复现研究在先前工作确立的两种不同检索场景中评估这些方法：文档内检索（大海捞针任务）和语料库内检索（标准信息检索任务）。综合评估表明，最优分块策略具有任务依赖性：对于语料库内检索，简单的基于结构方法优于LLM引导方案；而对于文档内检索，LumberChunker表现最佳。上下文感知分块能提升语料库内检索效果，但会降低文档内检索性能。我们还发现分块尺寸与文档内检索效果呈中度相关，而与语料库内检索效果关联较弱，这表明分割方法的差异并非纯粹由分块尺寸驱动。我们的代码与评估基准已公开于（匿名化处理）。