Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating "ideal" QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. We release our code and our dataset publicly to encourage further research.
翻译:临床问答系统有潜力为临床医生提供相关且及时的答案。然而,尽管已取得进展,这些系统在临床环境中的采用仍然缓慢。其中一个问题是缺乏反映卫生专业人员真实需求的问答数据集。在这项工作中,我们提出了RealMedQA,一个由人类和大型语言模型生成的真实临床问题数据集。我们描述了生成和验证问答对的过程,并在BioASQ和RealMedQA上评估了多个问答模型,以评估问题与答案匹配的相对难度。我们证明,大型语言模型在生成“理想”问答对方面更具成本效益。此外,与BioASQ相比,我们实现了问题与答案之间更低的词汇相似度,这为排名前两位的问答模型带来了额外的挑战,正如结果所示。我们公开了代码和数据集,以鼓励进一步研究。