Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.

翻译：将大型语言模型（LLM）整合至公共卫生政策领域，为管理疾病控制与预防中心（CDC）等机构维护的海量监管指导文件库提供了一种变革性方法。然而，LLM倾向于产生幻觉（即看似合理但事实错误的断言），这在信息完整性不容妥协的高风险环境中构成了技术应用的关键障碍。本实证研究通过将生成式输出锚定于权威文档上下文，探讨检索增强生成（RAG）架构在降低此类风险方面的有效性。具体而言，本研究对比了基线Vanilla LLM与采用交叉编码器重排序机制的Basic RAG及Advanced RAG流程。实验框架采用Mistral-7B-Instruct-v0.2模型与all-MiniLM-L6-v2嵌入模型处理CDC政策分析框架及指导文档语料库，通过针对复杂政策场景构建的忠实度与相关性评分，量化分析两种分块策略（基于递归字符的分割与基于标记的语义分割）对系统准确性的影响。定量结果表明：Basic RAG架构在忠实度（0.621）上较Vanilla基线（0.347）有显著提升，而Advanced RAG配置则达到0.797的优异忠实度均值。这些发现证明，双阶段检索机制对于实现领域特定政策问答所需的精确性至关重要，但文档分割中的结构约束仍是多步推理任务的主要瓶颈。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

基于强化学习的智能体化搜索全面综述：基础、角色、优化、评估与应用

专知会员服务

23+阅读 · 2025年10月22日

微软最新《检索增强生成（RAG）》综述

专知会员服务

57+阅读 · 2024年9月24日

《SysEngBench：评估系统工程中大型语言模型的新基准》美海军最新报告

专知会员服务

50+阅读 · 2024年6月30日

RAG与RAU：自然语言处理中的检索增强语言模型综述

专知会员服务

89+阅读 · 2024年5月3日