Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

from arxiv, 9 pages. v2: adds Anthropic Claude + 3 additional fine-tuned bases (1B-7B); 6 frontier families x 3 languages. Code https://github.com/vectrayx/precision-is-not-faithfulness Demo https://huggingface.co/spaces/jsantillana/faithful-strategy-engineer-f1

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 157 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). Fine-tuning small models (1B-7B) on the complete oracle closes the precision-recall gap entirely (F1 ~0.98), beating every zero-shot frontier system regardless of scale. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.

翻译：无参考忠实度指标通过验证模型生成的每个原子性陈述与真实情况的一致性，来评估接地生成任务。我们发现这些指标存在共同盲区：它们仅衡量精确性（即所陈述的声称是否得到支持），因此会奖励放弃回答的策略——模型通过几乎不输出内容即可获得近乎完美的忠实度评分。我们利用F1赛车遥测数据（该领域策略性真实结果可确定性推导且具有完备性：对每个决策点我们掌握所有相关事实的完整集合）使该问题可量化。这种完备性（在开放域忠实度基准中缺失）使我们能同时精确测量召回率（相关事实的覆盖度）与精确性。在覆盖157场比赛的7,253个决策实例的多语言（英/西/葡）基准测试中，精确性最高的前沿模型覆盖不足半数相关事实且F1值排名末位——要求覆盖度后系统排序发生根本性变化；该现象在第二个完备真实领域（NOAA天气预报）中复现。基于完备真实数据微调小模型（1B-7B）可彻底消除精确性与召回率差距（F1~0.98），其性能超越所有零样本前沿系统（无论模型规模）。我们将忠实度与覆盖度融合为单一评分，验证了该指标（扰动测试控制实验；无监督正则表达式提取器与跨系列大语言模型提取器存在一致性，系统级别斯皮尔曼相关系数达1.0），并提出无需参考答案即可同时提升精确性与召回率的验证器引导生成方法。我们公开了基准数据集、结构化标注、评估指标、基线模型及交互式演示系统。