VersionRAG：面向演化文档的版本感知检索增强生成 (VersionRAG: Version-Aware Retrieval-Augmented Generation for Evolving Documents)

Retrieval-Augmented Generation (RAG) systems fail when documents evolve through versioning-a ubiquitous characteristic of technical documentation. Existing approaches achieve only 58-64% accuracy on version-sensitive questions, retrieving semantically similar content without temporal validity checks. We present VersionRAG, a version-aware RAG framework that explicitly models document evolution through a hierarchical graph structure capturing version sequences, content boundaries, and changes between document states. During retrieval, VersionRAG routes queries through specialized paths based on intent classification, enabling precise version-aware filtering and change tracking. On our VersionQA benchmark-100 manually curated questions across 34 versioned technical documents-VersionRAG achieves 90% accuracy, outperforming naive RAG (58%) and GraphRAG (64%). VersionRAG reaches 60% accuracy on implicit change detection where baselines fail (0-10%), demonstrating its ability to track undocumented modifications. Additionally, VersionRAG requires 97% fewer tokens during indexing than GraphRAG, making it practical for large-scale deployment. Our work establishes versioned document QA as a distinct task and provides both a solution and benchmark for future research.

翻译：检索增强生成（RAG）系统在处理通过版本化演进的文档时（这是技术文档普遍存在的特性）会失效。现有方法在版本敏感问题上仅能达到58-64%的准确率，其检索语义相似内容时缺乏时间有效性校验。本文提出VersionRAG，一种版本感知的RAG框架，它通过一个层次化图结构显式建模文档演化过程，该结构捕获版本序列、内容边界以及文档状态间的变更。在检索阶段，VersionRAG基于意图分类将查询路由至专用路径，从而实现精确的版本感知过滤与变更追踪。在我们的VersionQA基准测试（涵盖34个版本化技术文档的100个人工标注问题）上，VersionRAG达到了90%的准确率，优于朴素RAG（58%）和GraphRAG（64%）。在基线方法完全失效（0-10%）的隐式变更检测任务上，VersionRAG取得了60%的准确率，证明了其追踪未记录修改的能力。此外，VersionRAG在索引阶段所需的令牌数比GraphRAG少97%，使其适用于大规模部署。我们的工作将版本化文档问答确立为一个独立任务，并为未来研究提供了解决方案和基准。