Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

翻译：大型语言模型（LLMs）经常产生幻觉，限制了其在知识密集型应用中的可靠性。检索增强生成（RAG）与保形事实性已成为解决这一局限性的潜在途径。虽然RAG旨在将回答基于检索到的证据，但它无法为最终输出的正确性提供统计保证。保形事实性过滤通过使用在保留数据上校准的阈值对原子主张进行评分和过滤，提供了无分布的统计可靠性，然而，最终输出的信息性无法得到保证。我们系统分析了基于RAG的LLMs在生成、评分、校准、稳健性和效率方面，保形事实性的可靠性与实用性。我们提出了新颖的、考虑信息性的指标，这些指标能更好地反映在保形过滤下的任务效用。在三个基准测试和多个模型系列中，我们发现：（i）在高事实性水平下，保形过滤因产生空洞输出而导致实用性低下；（ii）保形事实性保证对分布偏移和干扰项不稳健，这突显了校准数据需与部署条件紧密匹配的局限性；（iii）基于轻量级蕴含关系的验证器在性能上匹配或优于基于LLM的模型置信度评分器，同时所需FLOPs减少超过100倍。总体而言，我们的结果揭示了事实性与信息性之间的权衡，以及保形过滤框架在分布偏移和干扰项下的脆弱性，强调了需要以稳健性和实用性为关键指标开发新的可靠性方法，并为构建既可靠又计算高效的RAG流程提供了可行的指导。