Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents a sentence-aware, retrieval-augmented metric targeted at elaborative explanations in biomedical PLS tasks, providing the community with both a new benchmark and a practical evaluation tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact
翻译:大语言模型(LLMs)产生的幻觉输出在医疗领域(尤其是面向非专业读者健康决策时)构成风险。现有的自动事实一致性评估方法(如基于蕴含和问答(QA)的方法)在处理平实语言摘要(PLS)时面临困难,原因在于"详述解释现象"——这类摘要引入科学摘要中不包含的外部内容(如定义、背景、示例)以增强理解。为解决此问题,我们提出PlainQAFact——一种基于细粒度人工标注数据集PlainFact训练的自动事实一致性评估指标,用于评估原文简化句与详述解释句的事实一致性。PlainQAFact首先对句子类型进行分类,随后应用检索增强的QA评分方法。实验结果表明,现有评估指标难以评估PLS中的事实一致性(尤其是针对详述解释),而PlainQAFact在所有评估场景中均持续优于这些指标。我们进一步分析PlainQAFact在外知识源、答案抽取策略、答案重叠度量及文档粒度层面的有效性,从而优化其整体事实一致性评估。综上,本文提出一种面向生物医学PLS任务中详述解释的句子感知型检索增强指标,为社区提供新的基准数据集与实用评估工具,以推动医疗领域可靠、安全的平实语言沟通。PlainQAFact与PlainFact开源地址:https://github.com/zhiwenyou103/PlainQAFact