Summaries of medical text shall be faithful by being consistent and factual with source inputs, which is an important but understudied topic for safety and efficiency in healthcare. In this paper, we investigate and improve faithfulness in summarization on a broad range of medical summarization tasks. Our investigation reveals that current summarization models often produce unfaithful outputs for medical input text. We then introduce FaMeSumm, a framework to improve faithfulness by fine-tuning pre-trained language models based on medical knowledge. FaMeSumm performs contrastive learning on designed sets of faithful and unfaithful summaries, and it incorporates medical terms and their contexts to encourage faithful generation of medical terms. We conduct comprehensive experiments on three datasets in two languages: health question and radiology report summarization datasets in English, and a patient-doctor dialogue dataset in Chinese. Results demonstrate that FaMeSumm is flexible and effective by delivering consistent improvements over mainstream language models such as BART, T5, mT5, and PEGASUS, yielding state-of-the-art performances on metrics for faithfulness and general quality. Human evaluation by doctors also shows that FaMeSumm generates more faithful outputs. Our code is available at https://github.com/psunlpgroup/FaMeSumm .
翻译:医学文本摘要必须忠实于源输入,保持一致性和事实准确性,这是医疗安全与效率中的重要但研究不足的课题。本文针对多种医学摘要任务,探究并改进了摘要的真实性。研究发现,当前摘要模型在处理医学输入文本时常产生不忠实输出。为此,我们提出FaMeSumm框架,通过基于医学知识微调预训练语言模型来提升真实性。该框架对精心设计的忠实与不忠实摘要集进行对比学习,并融合医学术语及其上下文以促进医学术语的可靠生成。我们在两种语言的三个数据集上开展了全面实验:英文健康问答与放射报告摘要数据集,以及中文医患对话数据集。结果表明,FaMeSumm在BART、T5、mT5和PEGASUS等主流语言模型上均实现稳定提升,在真实性与总体质量指标上达到最优性能。医生人工评估也证实FaMeSumm能生成更真实的输出。我们的代码已开源至https://github.com/psunlpgroup/FaMeSumm。