RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
翻译:检索增强生成(RAG)管道通常依赖固定大小的分块策略,该策略忽视文档结构、割裂跨边界的语义单元,且需为每个分块进行多次大语言模型(LLM)调用来提取元数据。本文提出MDKeyChunker——面向Markdown文档的三阶段管道:(1)进行结构感知分块,将标题、代码块、表格和列表视为原子单元;(2)通过单次LLM调用增强每个分块,提取标题、摘要、关键词、类型化实体、假设性问题及语义键,同时传播滚动键字典以维护文档级上下文;(3)通过箱式打包合并共享相同语义键的分块以重构文档,实现相关内容的检索共定位。单次调用设计可在一次LLM调用中提取全部七个元数据字段,避免对每个字段进行独立提取传递。滚动键传播用LLM原生语义匹配取代了人工调参评分。基于包含18份Markdown文档语料库的30个查询的实验评估显示,配置D(基于结构分块的BM25)实现了Recall@5=1.000和MRR=0.911,而完整管道上的密集检索(配置C)达到Recall@5=0.867。MDKeyChunker基于Python实现,仅依赖四个库,并支持所有兼容OpenAI的端点。