Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.

翻译：检索增强生成（RAG）是当前将人工智能锚定于现实事实的行业标准。传统检索方法依赖关键词匹配与主题近似度，根据内容与用户查询的语义相似度进行排序。其未能衡量的是内容实际包含的已验证事实数量。这一结构性缺陷（称为"专家盲视效应"）导致标准RAG管道持续将高密度事实性证据淹没于同一主题的词汇主导性文本之下。为弥补这一不足，本文提出事实密度（FD*）——一种新型检索优化信号，用于测量经过验证的原子声明占总令牌数的比例。通过使用NexusAgentics Ghost审计预处理管道，原始文本经概率性事实性分析获得事实特异性评分，从而在语料库摄入前过滤内容。初始公式存在严重的文档长度混杂效应（Pearson R = -0.8636，p = 2.27e-07）。在长度分段内实施Z分数标准化后消除了该偏差，验证了FD*作为长度无关密度信号的有效性（p = 0.0749）。基于HealthFC基准（含750项由医学专家标注为"支持"、"反驳"或"无证据"的健康声明）评估显示，FD*优化检索是唯一实现前五结果中100%系统综述饱和度的条件，成功呈现了标准余弦相似度排名前十以外的Cochrane证据。真实事实验证确认了跨七项HealthFC支持声明的25个映射。由于语料库-基准对齐的限制，针对n=50次查询的完整统计验证尚待后续研究完成，但这些发现确立了事实密度重排序作为提升健康领域RAG架构事实精度的低成本、高影响力干预措施。