In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate the automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that enables the improvement of factuality through automatic evaluations. The OLAPH framework iteratively trains LLMs to mitigate hallucinations using sampling predictions and preference optimization. In other words, we iteratively set the highest-scoring response as a preferred response derived from sampling predictions and train LLMs to align with the preferred response that improves factuality. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available at https://github.com/dmis-lab/OLAPH}{https://github.com/dmis-lab/OLAPH.
翻译:在医疗领域,大量场景需要大语言模型具备长文本生成能力。具体而言,在回答患者问题时,模型生成的响应必须包含事实性陈述,这凸显了自动化评估这些陈述的必要性。为此,我们提出了MedLFQA —— 一个利用生物医学领域长文本问答数据集重构的基准数据集。我们借助MedLFQA实现事实性的自动化评估。同时,我们提出了OLAPH —— 一种简洁新颖的框架,可通过自动化评估提升事实性。该框架通过采样预测与偏好优化,以迭代方式训练大语言模型减少幻觉。换言之,我们迭代地将评分最高的响应设为源自采样预测的偏好响应,并训练大语言模型以此偏好响应为目标进行对齐,从而提升事实性。我们强调,即便对于训练过程中未使用的评估指标,经OLAPH框架训练的大语言模型在事实性上仍表现出显著的性能提升。研究结果表明,采用OLAPH框架训练的7B参数大语言模型,其长文本回答在事实性上可与医学专家相媲美。我们相信,本研究将为评估医疗领域大语言模型的长文本生成能力提供启示。相关代码与数据集已发布于 https://github.com/dmis-lab/OLAPH。