SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization

Large Language Models (LLMs) such as GPT & Llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical NLP applications where errors could lead to serious consequences. To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes >100B parameter GPT variants like GPT-3.5 & GPT-4 to act as synthetic experts to generate high-quality synthetics feedback aimed at enhancing factual consistency in clinical note summarization. Our research primarily focuses on edit feedback generated by these synthetic feedback experts without additional human annotations, mirroring and optimizing the practical scenario in which medical professionals refine AI system outputs. Although such 100B+ parameter GPT variants have proven to demonstrate expertise in various clinical NLP tasks, such as the Medical Licensing Examination, there is scant research on their capacity to act as synthetic feedback experts and deliver expert-level edit feedback for improving the generation quality of weaker (<10B parameter) LLMs like GPT-2 (1.5B) & Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback, that is used to reduce hallucinations and align weaker (<10B parameter) LLMs with medical facts using two distinct alignment algorithms (DPO & SALT), endeavoring to narrow the divide between AI-generated content and factual accuracy. This highlights the substantial potential of LLM-based synthetic edits in enhancing the alignment of clinical factuality.

翻译：大型语言模型（LLMs）如GPT与Llama已在摘要生成任务中取得显著成果，但在临床自然语言处理应用中存在关键性事实准确性偏差问题——此类错误可能导致严重后果。针对专家标注数据用于事实对齐时面临的成本高昂与可及性有限等挑战，本研究提出创新性流水线，利用GPT-3.5与GPT-4等超千亿参数GPT变体作为合成专家，生成高质量合成反馈以增强临床病历摘要的事实一致性。研究核心聚焦于这些合成反馈专家产生的编辑反馈（无需额外人工标注），模拟并优化临床医师修正AI系统输出的实际场景。尽管此类超千亿参数GPT变体已在多项临床自然语言处理任务（如医师资格考试）中展现专业能力，但其作为合成反馈专家、为临床领域较弱模型（<10亿参数，如GPT-2 1.5B与Llama 2 7B）提供编辑反馈以提升生成质量的能力鲜有研究。因此，本研究利用超千亿参数GPT变体作为合成反馈专家提供专家级编辑反馈，通过两种不同对齐算法（DPO与SALT）减少较弱模型的幻觉现象，使其与医学事实对齐，致力于缩小AI生成内容与事实准确性之间的差距。这凸显了基于LLM的合成编辑在提升临床事实对齐方面的巨大潜力。