Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

from arxiv, 16 pages, 8 tables. Includes Experiment 3 results (n=11, Wilcoxon p=0.0619). Preliminary findings; powered Experiment 3 and Graph RAG extension identified as future work. Updated from v1

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.

翻译：检索增强生成（RAG）是目前将人工智能锚定于现实事实的行业标准。传统检索方法依赖关键词匹配和主题相似度，根据内容与用户查询的语音相似程度进行排序，却无法衡量内容实际包含的已核实事实数量。这种结构缺陷——被称作"专家盲视效应"——导致标准RAG管道持续将高密度的事实证据埋没于同一主题下词汇主导性文本之中。为弥补这一缺陷，本文提出事实密度（FD*）这一新型检索优化信号，通过测量已核实原子声明占比相对于总词元数量的比例。基于NexusAgentics Ghost审核预处理管道，原始文本通过概率性事实性分析进行事实特异性评分，在语料库摄入前完成内容过滤。初始公式存在严重的文档长度混杂效应（Pearson R = -0.8636, p = 2.27e-07）。在长度分段内实施Z分数标准化后成功消除该偏差，验证了FD*作为长度无关密度信号的有效性（p = 0.0749）。以HealthFC基准（涵盖750条经医学专家标注为"支持"、"反驳"或"无证据"的健康声明）进行评估，经FD*优化的检索是唯一在前五结果中实现100%系统综述饱和度的方案，其展现的Cochrane证据位置在标准余弦相似度排名前十之外。真实标注验证确认了跨越七条HealthFC支持声明的25组映射关系。尽管受限于语料库-基准对齐条件，针对n=50查询的完整统计验证仍需后续研究，但本发现已证实事实密度重排序可作为提升健康领域RAG架构事实精度的低成本高效益干预手段。