Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we develop a rigorous labeling protocol for hallucinations, and have two medical experts annotate 100 real-world summaries and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. Although the effect is still present, it is much smaller for GPT-4 when prompted with five examples (0.70 to 0.40). We also conduct a qualitative evaluation using hallucination-free and improved training data. GPT-4 shows very good results even in the zero-shot setting. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which yields promising results.
翻译:患者常难以理解自身住院情况,而医护人员资源有限难以提供解释。本研究探究大型语言模型基于医生笔记生成患者摘要的潜力,并分析训练数据对生成摘要的忠实度与质量的影响。为此,我们制定了严格的幻觉标注协议,由两位医学专家对100份真实摘要和100份生成摘要进行标注。研究表明,在无幻觉数据上微调Llama 2可将每篇摘要的幻觉数量从2.60降至1.55,同时保留相关信息。虽然效果仍然存在,但使用五个示例提示GPT-4时影响较小(从0.70降至0.40)。我们进一步采用无幻觉与改进后的训练数据进行定性评估,发现GPT-4在零样本设置下仍表现出色。常见定量指标与忠实度和质量的关联性较差。最后,我们测试了GPT-4在自动幻觉检测中的表现,获得了令人鼓舞的结果。