The goal of automated summarization techniques (Paice, 1990; Kupiec et al, 1995) is to condense text by focusing on the most critical information. Generative large language models (LLMs) have shown to be robust summarizers, yet traditional metrics struggle to capture resulting performance (Goyal et al, 2022) in more powerful LLMs. In safety-critical domains such as medicine, more rigorous evaluation is required, especially given the potential for LLMs to omit important information in the resulting summary. We propose MED-OMIT, a new omission benchmark for medical summarization. Given a doctor-patient conversation and a generated summary, MED-OMIT categorizes the chat into a set of facts and identifies which are omitted from the summary. We further propose to determine fact importance by simulating the impact of each fact on a downstream clinical task: differential diagnosis (DDx) generation. MED-OMIT leverages LLM prompt-based approaches which categorize the importance of facts and cluster them as supporting or negating evidence to the diagnosis. We evaluate MED-OMIT on a publicly-released dataset of patient-doctor conversations and find that MED-OMIT captures omissions better than alternative metrics.
翻译:自动摘要技术(Paice, 1990; Kupiec 等, 1995)的目标是通过聚焦最关键信息来压缩文本。生成式大语言模型(LLM)已被证明是稳健的摘要工具,但传统指标难以捕捉更强LLM的生成性能(Goyal 等, 2022)。在医疗等安全关键领域,需要更严格的评估,尤其是考虑到LLM可能在生成的摘要中遗漏重要信息。我们提出MED-OMIT,一种新的医疗摘要省略基准。给定医患对话和生成的摘要,MED-OMIT将对话分类为一组事实,并识别摘要中遗漏的事实。我们进一步提出通过模拟每个事实对下游临床任务(即鉴别诊断生成)的影响来确定事实重要性。MED-OMIT利用基于LLM提示的方法,对事实重要性进行分类,并将其聚类为支持或否定诊断的证据。我们在公开的患者-医生对话数据集上评估MED-OMIT,发现其比替代指标能更好地捕捉省略情况。