OLAPH: Improving Factuality in Biomedical Long-form Question Answering

In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available.

翻译：在医学领域，众多场景需要大型语言模型（LLM）具备生成长文本的能力。具体而言，在回答患者问题时，模型的回应必须传达事实性陈述，这凸显了对自动化方法来评估这些陈述的需求。为此，我们引入了MedLFQA，这是一个利用生物医学领域相关的长文本问答数据集重构的基准数据集。我们使用MedLFQA来促进一种经济高效的自动事实性评估。我们还提出了OLAPH，一个简单而新颖的框架，该框架利用经济高效且多方面的自动评估来构建一个合成偏好集，并以我们偏好的方式回答问题。我们的框架引导我们逐步训练LLM，以减少幻觉并包含关键的医学陈述。我们强调，即使在训练过程中未使用的评估指标上，使用我们的OLAPH框架训练的LLM在事实性方面也表现出显著的性能提升。我们的研究结果表明，使用我们的OLAPH框架训练的7B参数LLM，在事实性方面能够提供与医学专家答案相媲美的长文本回答。我们相信，我们的工作有助于衡量LLM在医学领域的长文本生成能力。我们的代码和数据集已公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日