PT-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Papers

Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. Existing approaches typically flatten academic papers into unstructured chunks during preprocessing, which destroys the native hierarchical structure. This loss forces retrieval to operate in a disordered space, thereby producing fragmented contexts, misallocating tokens to non-evidential regions under finite token budgets, and increasing the reasoning burden for downstream language models. To address these issues, we propose PT-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior. PT-RAG first inherits the native hierarchy to construct a structure-fidelity PaperTree index, which prevents entropy increase at the source. It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts. In contrast to existing RAG approaches, PT-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. To assess this design, we introduce entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy. On three academic question-answering benchmarks, PT-RAG achieves consistently lower section entropy and evidence alignment cross entropy than strong baselines, indicating reduced context fragmentation and more precise allocation to evidential regions. These structural advantages directly translate into higher answer quality.

翻译：检索增强生成（RAG）在面向长篇学术论文的问答任务中应用日益广泛，其中在固定令牌预算下实现精确的证据分配至关重要。现有方法通常在预处理阶段将学术论文扁平化为非结构化文本块，这破坏了其固有的层次结构。这种结构损失导致检索在无序空间中进行，从而产生碎片化的上下文，在有限令牌预算下将令牌错误分配给非证据区域，并增加了下游语言模型的推理负担。为解决这些问题，我们提出了PT-RAG，一种将学术论文固有层次结构视为低熵检索先验的RAG框架。PT-RAG首先继承固有层次结构构建结构保真的PaperTree索引，从源头防止熵增。随后设计一种路径引导的检索机制，将查询语义对齐至相关章节，并在固定令牌预算下选择高相关性的根到叶路径，从而生成紧凑、连贯且低熵的检索上下文。与现有RAG方法相比，PT-RAG避免了破坏性预处理导致的熵增，为后续检索提供了天然的低熵结构基础。为评估该设计，我们引入了基于熵的结构诊断方法，用于量化检索碎片化程度与证据分配准确性。在三个学术问答基准测试中，PT-RAG相较于强基线模型，始终获得更低的章节熵和证据对齐交叉熵，表明其减少了上下文碎片化并实现了更精确的证据区域分配。这些结构优势直接转化为更高的答案质量。