In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available.
翻译:在医学领域,众多场景需要大型语言模型(LLM)具备生成长文本的能力。具体而言,在回答患者问题时,模型的回应必须传达事实性陈述,这凸显了对自动化方法来评估这些陈述的需求。为此,我们引入了MedLFQA,这是一个利用生物医学领域相关的长文本问答数据集重构的基准数据集。我们使用MedLFQA来促进一种经济高效的自动事实性评估。我们还提出了OLAPH,一个简单而新颖的框架,该框架利用经济高效且多方面的自动评估来构建一个合成偏好集,并以我们偏好的方式回答问题。我们的框架引导我们逐步训练LLM,以减少幻觉并包含关键的医学陈述。我们强调,即使在训练过程中未使用的评估指标上,使用我们的OLAPH框架训练的LLM在事实性方面也表现出显著的性能提升。我们的研究结果表明,使用我们的OLAPH框架训练的7B参数LLM,在事实性方面能够提供与医学专家答案相媲美的长文本回答。我们相信,我们的工作有助于衡量LLM在医学领域的长文本生成能力。我们的代码和数据集已公开。